Product purchases - if consumer buys product x how likely are they to buy product y
I have some data on store level purchases, a random sample of the data looks like the following:
PANID WEEK L1
966 3357632 2011-02-21 PIZZA
352 3357632 2009-11-09 SALTY
68 3357632 2012-06-18 BEER
65 3357632 2012-03-05 BEER
43 3108696 2011-10-31 BEER
672 3144766 2010-03-29 SALTY
70 3357632 2012-06-18 BEER
810 3144766 2012-06-18 SALTY
546 3144766 2008-05-05 SALTY
933 3357632 2009-06-15 PIZZA
(EDIT: This random sample contains 2012 info, the data I provided was filtered from 2007 - 2010 - due to character count)
There are 3 PANID
s across 4 years worth of data for 3 categories of products (BEER
, SALTY
, PIZZA
). I am trying to find out whether people who bought BEER
on a given week also bought SALTY
/ construct a probability that given that PANID: 3144766
bought SALTY
in WEEK: 2009-06-15
what is the probability that they will also by BEER
. Doing the same for pizza
, i.e. given that they bought BEER
the probability of them buying PIZZA
and finally given that they bought PIZZA
the probability of them buying SALTY
.
E.G. the person below bought 3 packets of SALTY
and one unit of BEER
but on a different week they would have just bought BEER
and PIZZA
or just SALTY
.
> data %>%
+ group_by(PANID) %>%
+ filter(WEEK == "2009-06-15") %>%
+ filter(PANID == "3144766")
# A tibble: 6 x 3
# Groups: PANID [1]
PANID WEEK L1
<int> <date> <chr>
1 3144766 2009-06-15 BEER
2 3144766 2009-06-15 SALTY
3 3144766 2009-06-15 SALTY
4 3144766 2009-06-15 SALTY
5 3144766 2009-06-15 PIZZA
6 3144766 2009-06-15 PIZZA
Data:
EDIT1: Data removed due to character limit. Can be found here: https://textuploader.com/db1kf
EDIT2:
I run the following code to get the below output:
#Probability of buying BEER or SALTY if PIZZA was bought
dat %>%
group_by(PIZZA > 0) %>%
summarise(beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying SALTY or PIZZA if BEER was bought
dat %>%
group_by(BEER > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying BEER or PIZZA if SALTY was bought
dat %>%
group_by(SALTY > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n())
Output:
1)
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
2)
# A tibble: 2 x 5
`BEER > 0` pizza nopizza salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.371 0.629 0.843 0.157
2 TRUE 0.290 0.710 0.532 0.468
3)
# A tibble: 2 x 5
`SALTY > 0` pizza nopizza beer nobeer
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.569 0.431 0.569 0.431
2 TRUE 0.272 0.728 0.219 0.781
Just so my understanding is correct. If I buy PIZZA
I have a 0.586
probability of buying SALTY
and a 0.414
probability of not buying SALTY
(table 1). However if I buy SALTY
then I would have a 0.272
probability of buying PIZZA
and a 0.728
of not buying PIZZA
(table 3)?
r
add a comment |
I have some data on store level purchases, a random sample of the data looks like the following:
PANID WEEK L1
966 3357632 2011-02-21 PIZZA
352 3357632 2009-11-09 SALTY
68 3357632 2012-06-18 BEER
65 3357632 2012-03-05 BEER
43 3108696 2011-10-31 BEER
672 3144766 2010-03-29 SALTY
70 3357632 2012-06-18 BEER
810 3144766 2012-06-18 SALTY
546 3144766 2008-05-05 SALTY
933 3357632 2009-06-15 PIZZA
(EDIT: This random sample contains 2012 info, the data I provided was filtered from 2007 - 2010 - due to character count)
There are 3 PANID
s across 4 years worth of data for 3 categories of products (BEER
, SALTY
, PIZZA
). I am trying to find out whether people who bought BEER
on a given week also bought SALTY
/ construct a probability that given that PANID: 3144766
bought SALTY
in WEEK: 2009-06-15
what is the probability that they will also by BEER
. Doing the same for pizza
, i.e. given that they bought BEER
the probability of them buying PIZZA
and finally given that they bought PIZZA
the probability of them buying SALTY
.
E.G. the person below bought 3 packets of SALTY
and one unit of BEER
but on a different week they would have just bought BEER
and PIZZA
or just SALTY
.
> data %>%
+ group_by(PANID) %>%
+ filter(WEEK == "2009-06-15") %>%
+ filter(PANID == "3144766")
# A tibble: 6 x 3
# Groups: PANID [1]
PANID WEEK L1
<int> <date> <chr>
1 3144766 2009-06-15 BEER
2 3144766 2009-06-15 SALTY
3 3144766 2009-06-15 SALTY
4 3144766 2009-06-15 SALTY
5 3144766 2009-06-15 PIZZA
6 3144766 2009-06-15 PIZZA
Data:
EDIT1: Data removed due to character limit. Can be found here: https://textuploader.com/db1kf
EDIT2:
I run the following code to get the below output:
#Probability of buying BEER or SALTY if PIZZA was bought
dat %>%
group_by(PIZZA > 0) %>%
summarise(beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying SALTY or PIZZA if BEER was bought
dat %>%
group_by(BEER > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying BEER or PIZZA if SALTY was bought
dat %>%
group_by(SALTY > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n())
Output:
1)
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
2)
# A tibble: 2 x 5
`BEER > 0` pizza nopizza salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.371 0.629 0.843 0.157
2 TRUE 0.290 0.710 0.532 0.468
3)
# A tibble: 2 x 5
`SALTY > 0` pizza nopizza beer nobeer
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.569 0.431 0.569 0.431
2 TRUE 0.272 0.728 0.219 0.781
Just so my understanding is correct. If I buy PIZZA
I have a 0.586
probability of buying SALTY
and a 0.414
probability of not buying SALTY
(table 1). However if I buy SALTY
then I would have a 0.272
probability of buying PIZZA
and a 0.728
of not buying PIZZA
(table 3)?
r
2
This seems to be more a question on statistical analysis (what do you need to do to calculate the probability) than it is a question in R (how do you code R to do the thing you need to do to calculate the probability).
– iod
Nov 13 '18 at 21:02
Yes that is what I am trying to do. Calculate the conditional probability that given person X bought productBEER
what the probability that same person (on the same shopping tripWEEK
) will put productSALTY
in their basket. - suggesting that beer and salty products are compliments and should have a higher probability of being combined than other products in the dataset i.e.DIAPERS
.
– user113156
Nov 13 '18 at 21:06
1
I'm just saying this is a statistics question, not an R question. Anyway hint re the R part:df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n)
– iod
Nov 13 '18 at 21:09
Thats quite helpful, thanks! This gives me the total purchases each consumer made for each product on a givenWEEK
?
– user113156
Nov 13 '18 at 21:19
1
That is correct
– iod
Nov 13 '18 at 21:22
add a comment |
I have some data on store level purchases, a random sample of the data looks like the following:
PANID WEEK L1
966 3357632 2011-02-21 PIZZA
352 3357632 2009-11-09 SALTY
68 3357632 2012-06-18 BEER
65 3357632 2012-03-05 BEER
43 3108696 2011-10-31 BEER
672 3144766 2010-03-29 SALTY
70 3357632 2012-06-18 BEER
810 3144766 2012-06-18 SALTY
546 3144766 2008-05-05 SALTY
933 3357632 2009-06-15 PIZZA
(EDIT: This random sample contains 2012 info, the data I provided was filtered from 2007 - 2010 - due to character count)
There are 3 PANID
s across 4 years worth of data for 3 categories of products (BEER
, SALTY
, PIZZA
). I am trying to find out whether people who bought BEER
on a given week also bought SALTY
/ construct a probability that given that PANID: 3144766
bought SALTY
in WEEK: 2009-06-15
what is the probability that they will also by BEER
. Doing the same for pizza
, i.e. given that they bought BEER
the probability of them buying PIZZA
and finally given that they bought PIZZA
the probability of them buying SALTY
.
E.G. the person below bought 3 packets of SALTY
and one unit of BEER
but on a different week they would have just bought BEER
and PIZZA
or just SALTY
.
> data %>%
+ group_by(PANID) %>%
+ filter(WEEK == "2009-06-15") %>%
+ filter(PANID == "3144766")
# A tibble: 6 x 3
# Groups: PANID [1]
PANID WEEK L1
<int> <date> <chr>
1 3144766 2009-06-15 BEER
2 3144766 2009-06-15 SALTY
3 3144766 2009-06-15 SALTY
4 3144766 2009-06-15 SALTY
5 3144766 2009-06-15 PIZZA
6 3144766 2009-06-15 PIZZA
Data:
EDIT1: Data removed due to character limit. Can be found here: https://textuploader.com/db1kf
EDIT2:
I run the following code to get the below output:
#Probability of buying BEER or SALTY if PIZZA was bought
dat %>%
group_by(PIZZA > 0) %>%
summarise(beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying SALTY or PIZZA if BEER was bought
dat %>%
group_by(BEER > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying BEER or PIZZA if SALTY was bought
dat %>%
group_by(SALTY > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n())
Output:
1)
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
2)
# A tibble: 2 x 5
`BEER > 0` pizza nopizza salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.371 0.629 0.843 0.157
2 TRUE 0.290 0.710 0.532 0.468
3)
# A tibble: 2 x 5
`SALTY > 0` pizza nopizza beer nobeer
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.569 0.431 0.569 0.431
2 TRUE 0.272 0.728 0.219 0.781
Just so my understanding is correct. If I buy PIZZA
I have a 0.586
probability of buying SALTY
and a 0.414
probability of not buying SALTY
(table 1). However if I buy SALTY
then I would have a 0.272
probability of buying PIZZA
and a 0.728
of not buying PIZZA
(table 3)?
r
I have some data on store level purchases, a random sample of the data looks like the following:
PANID WEEK L1
966 3357632 2011-02-21 PIZZA
352 3357632 2009-11-09 SALTY
68 3357632 2012-06-18 BEER
65 3357632 2012-03-05 BEER
43 3108696 2011-10-31 BEER
672 3144766 2010-03-29 SALTY
70 3357632 2012-06-18 BEER
810 3144766 2012-06-18 SALTY
546 3144766 2008-05-05 SALTY
933 3357632 2009-06-15 PIZZA
(EDIT: This random sample contains 2012 info, the data I provided was filtered from 2007 - 2010 - due to character count)
There are 3 PANID
s across 4 years worth of data for 3 categories of products (BEER
, SALTY
, PIZZA
). I am trying to find out whether people who bought BEER
on a given week also bought SALTY
/ construct a probability that given that PANID: 3144766
bought SALTY
in WEEK: 2009-06-15
what is the probability that they will also by BEER
. Doing the same for pizza
, i.e. given that they bought BEER
the probability of them buying PIZZA
and finally given that they bought PIZZA
the probability of them buying SALTY
.
E.G. the person below bought 3 packets of SALTY
and one unit of BEER
but on a different week they would have just bought BEER
and PIZZA
or just SALTY
.
> data %>%
+ group_by(PANID) %>%
+ filter(WEEK == "2009-06-15") %>%
+ filter(PANID == "3144766")
# A tibble: 6 x 3
# Groups: PANID [1]
PANID WEEK L1
<int> <date> <chr>
1 3144766 2009-06-15 BEER
2 3144766 2009-06-15 SALTY
3 3144766 2009-06-15 SALTY
4 3144766 2009-06-15 SALTY
5 3144766 2009-06-15 PIZZA
6 3144766 2009-06-15 PIZZA
Data:
EDIT1: Data removed due to character limit. Can be found here: https://textuploader.com/db1kf
EDIT2:
I run the following code to get the below output:
#Probability of buying BEER or SALTY if PIZZA was bought
dat %>%
group_by(PIZZA > 0) %>%
summarise(beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying SALTY or PIZZA if BEER was bought
dat %>%
group_by(BEER > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
salty = sum(SALTY > 0) / n(), nosalty = sum(SALTY == 0) / n())
#Probability of buying BEER or PIZZA if SALTY was bought
dat %>%
group_by(SALTY > 0) %>%
summarise(pizza = sum(PIZZA > 0) / n(), nopizza = sum(PIZZA == 0) / n(),
beer = sum(BEER > 0) / n(), nobeer = sum(BEER == 0) / n())
Output:
1)
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
2)
# A tibble: 2 x 5
`BEER > 0` pizza nopizza salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.371 0.629 0.843 0.157
2 TRUE 0.290 0.710 0.532 0.468
3)
# A tibble: 2 x 5
`SALTY > 0` pizza nopizza beer nobeer
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.569 0.431 0.569 0.431
2 TRUE 0.272 0.728 0.219 0.781
Just so my understanding is correct. If I buy PIZZA
I have a 0.586
probability of buying SALTY
and a 0.414
probability of not buying SALTY
(table 1). However if I buy SALTY
then I would have a 0.272
probability of buying PIZZA
and a 0.728
of not buying PIZZA
(table 3)?
r
r
edited Nov 14 '18 at 16:09
user113156
asked Nov 13 '18 at 20:53
user113156user113156
8971419
8971419
2
This seems to be more a question on statistical analysis (what do you need to do to calculate the probability) than it is a question in R (how do you code R to do the thing you need to do to calculate the probability).
– iod
Nov 13 '18 at 21:02
Yes that is what I am trying to do. Calculate the conditional probability that given person X bought productBEER
what the probability that same person (on the same shopping tripWEEK
) will put productSALTY
in their basket. - suggesting that beer and salty products are compliments and should have a higher probability of being combined than other products in the dataset i.e.DIAPERS
.
– user113156
Nov 13 '18 at 21:06
1
I'm just saying this is a statistics question, not an R question. Anyway hint re the R part:df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n)
– iod
Nov 13 '18 at 21:09
Thats quite helpful, thanks! This gives me the total purchases each consumer made for each product on a givenWEEK
?
– user113156
Nov 13 '18 at 21:19
1
That is correct
– iod
Nov 13 '18 at 21:22
add a comment |
2
This seems to be more a question on statistical analysis (what do you need to do to calculate the probability) than it is a question in R (how do you code R to do the thing you need to do to calculate the probability).
– iod
Nov 13 '18 at 21:02
Yes that is what I am trying to do. Calculate the conditional probability that given person X bought productBEER
what the probability that same person (on the same shopping tripWEEK
) will put productSALTY
in their basket. - suggesting that beer and salty products are compliments and should have a higher probability of being combined than other products in the dataset i.e.DIAPERS
.
– user113156
Nov 13 '18 at 21:06
1
I'm just saying this is a statistics question, not an R question. Anyway hint re the R part:df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n)
– iod
Nov 13 '18 at 21:09
Thats quite helpful, thanks! This gives me the total purchases each consumer made for each product on a givenWEEK
?
– user113156
Nov 13 '18 at 21:19
1
That is correct
– iod
Nov 13 '18 at 21:22
2
2
This seems to be more a question on statistical analysis (what do you need to do to calculate the probability) than it is a question in R (how do you code R to do the thing you need to do to calculate the probability).
– iod
Nov 13 '18 at 21:02
This seems to be more a question on statistical analysis (what do you need to do to calculate the probability) than it is a question in R (how do you code R to do the thing you need to do to calculate the probability).
– iod
Nov 13 '18 at 21:02
Yes that is what I am trying to do. Calculate the conditional probability that given person X bought product
BEER
what the probability that same person (on the same shopping trip WEEK
) will put product SALTY
in their basket. - suggesting that beer and salty products are compliments and should have a higher probability of being combined than other products in the dataset i.e. DIAPERS
.– user113156
Nov 13 '18 at 21:06
Yes that is what I am trying to do. Calculate the conditional probability that given person X bought product
BEER
what the probability that same person (on the same shopping trip WEEK
) will put product SALTY
in their basket. - suggesting that beer and salty products are compliments and should have a higher probability of being combined than other products in the dataset i.e. DIAPERS
.– user113156
Nov 13 '18 at 21:06
1
1
I'm just saying this is a statistics question, not an R question. Anyway hint re the R part:
df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n)
– iod
Nov 13 '18 at 21:09
I'm just saying this is a statistics question, not an R question. Anyway hint re the R part:
df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n)
– iod
Nov 13 '18 at 21:09
Thats quite helpful, thanks! This gives me the total purchases each consumer made for each product on a given
WEEK
?– user113156
Nov 13 '18 at 21:19
Thats quite helpful, thanks! This gives me the total purchases each consumer made for each product on a given
WEEK
?– user113156
Nov 13 '18 at 21:19
1
1
That is correct
– iod
Nov 13 '18 at 21:22
That is correct
– iod
Nov 13 '18 at 21:22
add a comment |
1 Answer
1
active
oldest
votes
I'm not 100% sure this is what you're looking for, so let me know if I'm off track.
We start with what I suggested in the comment (slightly adjusted to replace the NAs with 0):
df<- df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n, fill=0)
This gives us a wide-data format where for each person-week, we see the number of purchases of each of the three food types, e.g.:
> head(df,3)
# A tibble: 3 x 6
# Groups: PANID, year, WEEK [3]
PANID year WEEK BEER PIZZA SALTY
<int> <int> <date> <dbl> <dbl> <dbl>
1 3108696 2007 2007-12-31 2 4 6
2 3108696 2008 2008-01-21 0 2 2
3 3108696 2008 2008-02-04 1 0 2
Now we can create a table that gives the probability for buying either BEER or SALTY (of any amount) if PIZZA (of any amount) was purchased in the same week:
df %>% group_by(PIZZA>0) %>%
summarise(beer=sum(BEER>0)/n(),nobeer=sum(BEER==0)/n(),
salty=sum(SALTY>0)/n(),nosalty=sum(SALTY==0)/n())
Result:
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
So we can see that if Pizza was purchased, the likelihood of both beer and salty goes down compared to weeks when pizza is not purchased.
The same can be done for BEER and SALTY, of course.
An alternative, since we have a numerical variable for each of the three foods, would be to calculate correlation or even regression, but that's not what you asked for.
Yes I believe this is what I was hoping for. The regression will be the next step. I have made an edit to my original post regarding the understanding of the output probabilities.
– user113156
Nov 14 '18 at 16:11
Great. Don't forget to accept! Re your question - yes, your interpretation is correct. How this happens may be clearer if you look at the absolute numbers, by removing all the/n()
s from the code.
– iod
Nov 14 '18 at 18:12
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53289340%2fproduct-purchases-if-consumer-buys-product-x-how-likely-are-they-to-buy-produc%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm not 100% sure this is what you're looking for, so let me know if I'm off track.
We start with what I suggested in the comment (slightly adjusted to replace the NAs with 0):
df<- df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n, fill=0)
This gives us a wide-data format where for each person-week, we see the number of purchases of each of the three food types, e.g.:
> head(df,3)
# A tibble: 3 x 6
# Groups: PANID, year, WEEK [3]
PANID year WEEK BEER PIZZA SALTY
<int> <int> <date> <dbl> <dbl> <dbl>
1 3108696 2007 2007-12-31 2 4 6
2 3108696 2008 2008-01-21 0 2 2
3 3108696 2008 2008-02-04 1 0 2
Now we can create a table that gives the probability for buying either BEER or SALTY (of any amount) if PIZZA (of any amount) was purchased in the same week:
df %>% group_by(PIZZA>0) %>%
summarise(beer=sum(BEER>0)/n(),nobeer=sum(BEER==0)/n(),
salty=sum(SALTY>0)/n(),nosalty=sum(SALTY==0)/n())
Result:
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
So we can see that if Pizza was purchased, the likelihood of both beer and salty goes down compared to weeks when pizza is not purchased.
The same can be done for BEER and SALTY, of course.
An alternative, since we have a numerical variable for each of the three foods, would be to calculate correlation or even regression, but that's not what you asked for.
Yes I believe this is what I was hoping for. The regression will be the next step. I have made an edit to my original post regarding the understanding of the output probabilities.
– user113156
Nov 14 '18 at 16:11
Great. Don't forget to accept! Re your question - yes, your interpretation is correct. How this happens may be clearer if you look at the absolute numbers, by removing all the/n()
s from the code.
– iod
Nov 14 '18 at 18:12
add a comment |
I'm not 100% sure this is what you're looking for, so let me know if I'm off track.
We start with what I suggested in the comment (slightly adjusted to replace the NAs with 0):
df<- df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n, fill=0)
This gives us a wide-data format where for each person-week, we see the number of purchases of each of the three food types, e.g.:
> head(df,3)
# A tibble: 3 x 6
# Groups: PANID, year, WEEK [3]
PANID year WEEK BEER PIZZA SALTY
<int> <int> <date> <dbl> <dbl> <dbl>
1 3108696 2007 2007-12-31 2 4 6
2 3108696 2008 2008-01-21 0 2 2
3 3108696 2008 2008-02-04 1 0 2
Now we can create a table that gives the probability for buying either BEER or SALTY (of any amount) if PIZZA (of any amount) was purchased in the same week:
df %>% group_by(PIZZA>0) %>%
summarise(beer=sum(BEER>0)/n(),nobeer=sum(BEER==0)/n(),
salty=sum(SALTY>0)/n(),nosalty=sum(SALTY==0)/n())
Result:
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
So we can see that if Pizza was purchased, the likelihood of both beer and salty goes down compared to weeks when pizza is not purchased.
The same can be done for BEER and SALTY, of course.
An alternative, since we have a numerical variable for each of the three foods, would be to calculate correlation or even regression, but that's not what you asked for.
Yes I believe this is what I was hoping for. The regression will be the next step. I have made an edit to my original post regarding the understanding of the output probabilities.
– user113156
Nov 14 '18 at 16:11
Great. Don't forget to accept! Re your question - yes, your interpretation is correct. How this happens may be clearer if you look at the absolute numbers, by removing all the/n()
s from the code.
– iod
Nov 14 '18 at 18:12
add a comment |
I'm not 100% sure this is what you're looking for, so let me know if I'm off track.
We start with what I suggested in the comment (slightly adjusted to replace the NAs with 0):
df<- df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n, fill=0)
This gives us a wide-data format where for each person-week, we see the number of purchases of each of the three food types, e.g.:
> head(df,3)
# A tibble: 3 x 6
# Groups: PANID, year, WEEK [3]
PANID year WEEK BEER PIZZA SALTY
<int> <int> <date> <dbl> <dbl> <dbl>
1 3108696 2007 2007-12-31 2 4 6
2 3108696 2008 2008-01-21 0 2 2
3 3108696 2008 2008-02-04 1 0 2
Now we can create a table that gives the probability for buying either BEER or SALTY (of any amount) if PIZZA (of any amount) was purchased in the same week:
df %>% group_by(PIZZA>0) %>%
summarise(beer=sum(BEER>0)/n(),nobeer=sum(BEER==0)/n(),
salty=sum(SALTY>0)/n(),nosalty=sum(SALTY==0)/n())
Result:
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
So we can see that if Pizza was purchased, the likelihood of both beer and salty goes down compared to weeks when pizza is not purchased.
The same can be done for BEER and SALTY, of course.
An alternative, since we have a numerical variable for each of the three foods, would be to calculate correlation or even regression, but that's not what you asked for.
I'm not 100% sure this is what you're looking for, so let me know if I'm off track.
We start with what I suggested in the comment (slightly adjusted to replace the NAs with 0):
df<- df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n, fill=0)
This gives us a wide-data format where for each person-week, we see the number of purchases of each of the three food types, e.g.:
> head(df,3)
# A tibble: 3 x 6
# Groups: PANID, year, WEEK [3]
PANID year WEEK BEER PIZZA SALTY
<int> <int> <date> <dbl> <dbl> <dbl>
1 3108696 2007 2007-12-31 2 4 6
2 3108696 2008 2008-01-21 0 2 2
3 3108696 2008 2008-02-04 1 0 2
Now we can create a table that gives the probability for buying either BEER or SALTY (of any amount) if PIZZA (of any amount) was purchased in the same week:
df %>% group_by(PIZZA>0) %>%
summarise(beer=sum(BEER>0)/n(),nobeer=sum(BEER==0)/n(),
salty=sum(SALTY>0)/n(),nosalty=sum(SALTY==0)/n())
Result:
# A tibble: 2 x 5
`PIZZA > 0` beer nobeer salty nosalty
<lgl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.333 0.667 0.833 0.167
2 TRUE 0.257 0.743 0.586 0.414
So we can see that if Pizza was purchased, the likelihood of both beer and salty goes down compared to weeks when pizza is not purchased.
The same can be done for BEER and SALTY, of course.
An alternative, since we have a numerical variable for each of the three foods, would be to calculate correlation or even regression, but that's not what you asked for.
answered Nov 14 '18 at 14:10
iodiod
3,8532722
3,8532722
Yes I believe this is what I was hoping for. The regression will be the next step. I have made an edit to my original post regarding the understanding of the output probabilities.
– user113156
Nov 14 '18 at 16:11
Great. Don't forget to accept! Re your question - yes, your interpretation is correct. How this happens may be clearer if you look at the absolute numbers, by removing all the/n()
s from the code.
– iod
Nov 14 '18 at 18:12
add a comment |
Yes I believe this is what I was hoping for. The regression will be the next step. I have made an edit to my original post regarding the understanding of the output probabilities.
– user113156
Nov 14 '18 at 16:11
Great. Don't forget to accept! Re your question - yes, your interpretation is correct. How this happens may be clearer if you look at the absolute numbers, by removing all the/n()
s from the code.
– iod
Nov 14 '18 at 18:12
Yes I believe this is what I was hoping for. The regression will be the next step. I have made an edit to my original post regarding the understanding of the output probabilities.
– user113156
Nov 14 '18 at 16:11
Yes I believe this is what I was hoping for. The regression will be the next step. I have made an edit to my original post regarding the understanding of the output probabilities.
– user113156
Nov 14 '18 at 16:11
Great. Don't forget to accept! Re your question - yes, your interpretation is correct. How this happens may be clearer if you look at the absolute numbers, by removing all the
/n()
s from the code.– iod
Nov 14 '18 at 18:12
Great. Don't forget to accept! Re your question - yes, your interpretation is correct. How this happens may be clearer if you look at the absolute numbers, by removing all the
/n()
s from the code.– iod
Nov 14 '18 at 18:12
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53289340%2fproduct-purchases-if-consumer-buys-product-x-how-likely-are-they-to-buy-produc%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
2
This seems to be more a question on statistical analysis (what do you need to do to calculate the probability) than it is a question in R (how do you code R to do the thing you need to do to calculate the probability).
– iod
Nov 13 '18 at 21:02
Yes that is what I am trying to do. Calculate the conditional probability that given person X bought product
BEER
what the probability that same person (on the same shopping tripWEEK
) will put productSALTY
in their basket. - suggesting that beer and salty products are compliments and should have a higher probability of being combined than other products in the dataset i.e.DIAPERS
.– user113156
Nov 13 '18 at 21:06
1
I'm just saying this is a statistics question, not an R question. Anyway hint re the R part:
df %>% group_by(PANID, year, WEEK,L1) %>% summarize(n=n()) %>% tidyr::spread(L1, n)
– iod
Nov 13 '18 at 21:09
Thats quite helpful, thanks! This gives me the total purchases each consumer made for each product on a given
WEEK
?– user113156
Nov 13 '18 at 21:19
1
That is correct
– iod
Nov 13 '18 at 21:22