complex pandas sub setting; selecting rows that match criteria in many columns

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;

I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.

The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:

AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]

I've made the following queries with pandas, they all work but seem very clunky:

-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE

Here is the code:

 """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) & 
 ((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) & 
 ((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
 (df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
 ((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
 ((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) | 
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]

This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:

AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]

I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks

EDIT: Minimal example as requested:

mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2']) 
for m,n in zip(m_list, mutations):
 df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]

This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.

edited Nov 15 '18 at 13:23

asked Nov 15 '18 at 12:21

user3062260

435522

I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

– user3062260
Nov 15 '18 at 13:26

Consider checking this link and also query.

– Sandeep Kadapa
Nov 15 '18 at 13:45

add a comment |

The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:

AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]

I've made the following queries with pandas, they all work but seem very clunky:

-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE

Here is the code:

 """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) & 
 ((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) & 
 ((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
 (df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
 ((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
 ((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) | 
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]

AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]

I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks

EDIT: Minimal example as requested:

mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2']) 
for m,n in zip(m_list, mutations):
 df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]

edited Nov 15 '18 at 13:23

asked Nov 15 '18 at 12:21

user3062260

435522

I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

– user3062260
Nov 15 '18 at 13:26

Consider checking this link and also query.

– Sandeep Kadapa
Nov 15 '18 at 13:45

add a comment |

The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:

AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]

I've made the following queries with pandas, they all work but seem very clunky:

-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE

Here is the code:

 """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) & 
 ((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) & 
 ((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
 (df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
 ((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
 ((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) | 
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]

AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]

I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks

EDIT: Minimal example as requested:

mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2']) 
for m,n in zip(m_list, mutations):
 df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]

edited Nov 15 '18 at 13:23

asked Nov 15 '18 at 12:21

user3062260

435522

The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:

AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]

I've made the following queries with pandas, they all work but seem very clunky:

-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE

Here is the code:

 """Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]


"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) | 
 (df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) & 
 ((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) & 
 ((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
 (df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
 ((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
 ((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) | 
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]

AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]

I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks

EDIT: Minimal example as requested:

mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]

import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]

df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2']) 
for m,n in zip(m_list, mutations):
 df.loc[m]=n

AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']

"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]

"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]

"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]

python pandas

edited Nov 15 '18 at 13:23

asked Nov 15 '18 at 12:21

user3062260

435522

edited Nov 15 '18 at 13:23

asked Nov 15 '18 at 12:21

user3062260

435522

edited Nov 15 '18 at 13:23

asked Nov 15 '18 at 12:21

user3062260

435522

asked Nov 15 '18 at 12:21

user3062260

435522

asked Nov 15 '18 at 12:21

user3062260

435522

I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

– user3062260
Nov 15 '18 at 13:26

Consider checking this link and also query.

– Sandeep Kadapa
Nov 15 '18 at 13:45

add a comment |

I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

– user3062260
Nov 15 '18 at 13:26

Consider checking this link and also query.

– Sandeep Kadapa
Nov 15 '18 at 13:45

I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset

– user3062260
Nov 15 '18 at 13:26

Consider checking this link and also query.

– Sandeep Kadapa
Nov 15 '18 at 13:45

add a comment |

1 Answer
1

active

oldest

votes

`eq` + `any` / `all` + `loc`

Vectorised, you can subset your dataframe and use equality and any / all operations:

# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]

If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.

edited Nov 15 '18 at 14:15

answered Nov 15 '18 at 14:10

jpp

103k2167117

This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43

for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45

1

This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319422%2fcomplex-pandas-sub-setting-selecting-rows-that-match-criteria-in-many-columns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

`eq` + `any` / `all` + `loc`

Vectorised, you can subset your dataframe and use equality and any / all operations:

# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]

If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.

edited Nov 15 '18 at 14:15

answered Nov 15 '18 at 14:10

jpp

103k2167117

This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43

for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45

1

This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47

add a comment |

`eq` + `any` / `all` + `loc`

Vectorised, you can subset your dataframe and use equality and any / all operations:

# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]

If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.

edited Nov 15 '18 at 14:15

answered Nov 15 '18 at 14:10

jpp

103k2167117

This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43

for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45

1

This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47

add a comment |

`eq` + `any` / `all` + `loc`

Vectorised, you can subset your dataframe and use equality and any / all operations:

# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]

If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.

edited Nov 15 '18 at 14:15

answered Nov 15 '18 at 14:10

jpp

103k2167117

`eq` + `any` / `all` + `loc`

Vectorised, you can subset your dataframe and use equality and any / all operations:

# Get variants in AE and in HE but not in BE

m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)

df_filtered = df.loc[m1 & m2 & m3]

If, as you describe, all your values are 0, 0.5 or 1.0, saying selected values cannot be 1.0 or 0.5 is the same as requiring them to be 0.

edited Nov 15 '18 at 14:15

answered Nov 15 '18 at 14:10

jpp

103k2167117

edited Nov 15 '18 at 14:15

answered Nov 15 '18 at 14:10

jpp

103k2167117

answered Nov 15 '18 at 14:10

jpp

103k2167117

answered Nov 15 '18 at 14:10

jpp

103k2167117

This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43

for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45

1

This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47

add a comment |

This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43

for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45

1

This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47

This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.

– user3062260
Nov 15 '18 at 14:43

for the BE condition you can use something like df[BE].ne(1).any(1) & df[BE].eq(0).any(1); ne stands for "not equal".

– jpp
Nov 15 '18 at 14:45

This is perfect! Thanks for your help!

– user3062260
Nov 16 '18 at 11:47

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

OnS G77px9fB M5EHqSCxImOXzzobW yNETuuI6H0pJ

搜尋此網誌

Pfthb