complex pandas sub setting; selecting rows that match criteria in many columns
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.
The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:
AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]
I've made the following queries with pandas, they all work but seem very clunky:
-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE
Here is the code:
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]
This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:
AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]
I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks
#
EDIT: Minimal example as requested:
mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]
import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]
df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n
AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]
This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.
python pandas
add a comment |
I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.
The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:
AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]
I've made the following queries with pandas, they all work but seem very clunky:
-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE
Here is the code:
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]
This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:
AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]
I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks
#
EDIT: Minimal example as requested:
mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]
import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]
df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n
AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]
This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.
python pandas
I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset
– user3062260
Nov 15 '18 at 13:26
Consider checking thislink
and alsoquery
.
– Sandeep Kadapa
Nov 15 '18 at 13:45
add a comment |
I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.
The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:
AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]
I've made the following queries with pandas, they all work but seem very clunky:
-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE
Here is the code:
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]
This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:
AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]
I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks
#
EDIT: Minimal example as requested:
mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]
import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]
df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n
AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]
This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.
python pandas
I'm selecting data from a pandas dataframe which is ~1.5M rows by 22 columns. Each column is a sample and each row is an observation about a mutation. 1.0 means the sample has a mutation, 0.0 means the sample does not have the specific mutation and 0.5 means there is no data on that sample for that mutation.
The samples are from one of 3 tissue types, lets call the tissues AE, BE and HE. The samples fit into the categories as follows:
AE=["X14AE","X15AE","X22AE","X23AE","AE21.35","AE36.45","AE46.55","AE61.80",]
BE=["X161724BE","BE1.2","BE1.8","BE2","BE9.13"]
HE=["X11HE","X18HE","HE17.24","HE2.4.5.6","HE8.15","HE8.9"]
I've made the following queries with pandas, they all work but seem very clunky:
-Get variants in AE and in HE but not in BE
-Get variants in AE and in BE but not in HE
-Get variants in BE and in HE but not in AE
Here is the code:
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) & ((df["X161724BE"] != 1.0) & (df["BE1.2"] != 1.0) &
(df["BE1.8"] != 1.0) & (df["BE2"] != 1.0) & (df["BE9.13"] != 1.0)) & ((df["X161724BE"] != 0.5) | (df["BE1.2"] != 0.5) |
(df["BE1.8"] != 0.5) | (df["BE2"] != 0.5) | (df["BE9.13"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["X14AE"] == 1.0) | (df["X15AE"] == 1.0) | (df["X22AE"] == 1.0) |
(df["X23AE"] == 1.0) | (df["AE21.35"] == 1.0) | (df["AE36.45"] == 1.0) | (df["AE61.80"] == 1.0)) &
((df["X11HE"] != 1.0) & (df["X18HE"] != 1.0) &(df["HE17.24"] != 1.0) & (df["HE2.4.5.6"] != 1.0) &
(df["HE8.15"] != 1.0) & (df["HE8.9"] != 1.0)) &
((df["X11HE"] != 0.5) | (df["X18HE"] != 0.5) |(df["HE17.24"] != 0.5) |(df["HE2.4.5.6"] != 0.5) |
(df["HE8.15"] != 0.5) | (df["HE8.9"] != 0.5)) &
((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["X161724BE"] == 1.0) | (df["BE1.2"] == 1.0) |
(df["BE1.8"] == 1.0) | (df["BE2"] != 1.0) | (df["BE9.13"] == 1.0)) &
((df["X11HE"] == 1.0) | (df["X18HE"] == 1.0) |(df["HE17.24"] == 1.0) |(df["HE2.4.5.6"] == 1.0) |
(df["HE8.15"] == 1.0) | (df["HE8.9"] == 1.0)) &
((df["X14AE"] != 1.0) & (df["X15AE"] != 1.0) & (df["X22AE"] != 1.0) &
(df["X23AE"] != 1.0) & (df["AE21.35"] != 1.0) & (df["AE36.45"] != 1.0) & (df["AE61.80"] != 1.0)) &
((df["X14AE"] != 0.5) | (df["X15AE"] != 0.5) | (df["X22AE"] != 0.5) |
(df["X23AE"] != 0.5) | (df["AE21.35"] != 0.5) | (df["AE36.45"] != 0.5) | (df["AE61.80"] != 0.5))]
This works fine but seems incredibly clunky and just not very elegant and if I need to change a few things (such as sample names) it is time consuming to re-write, can anyone help me with a simple way to re-write this query? I was wondering if there is a way where I could just pass in each list with a criteria? Something like:
AE_HE_notBE = df.loc[((df.[at least 1 sample from AE_list] == 1.0) & (df.[at least 1 sample from HE_list] == 1.0) & (df.[no sample from BE_list] == 1.0) & (df.[at least 1 sample from BE_list] == 0.0))]
I find I need to subset rows based on multiple columns where the columns can be grouped quite regularly so if anyone can make this kind of query more concise I would really appreciate it. Many thanks
#
EDIT: Minimal example as requested:
mutations=[[1,1,0,0,0.5,0],
[1,0,0,0,1,0],
[1,1,0,0.5,0,0],
[0,0.5,0,1,0,1],
[0,1,0,0,0,0],
[1,0,0,0,0,0],
[1,0,1,0,1,0],
[0,0,0,1,0.5,1],
[0,1,1,1,0,0],
[1,0.5,0,1,0,0]]
import string
import pandas as pd
m_list=[x for x in string.ascii_lowercase[:10]]
df=pd.DataFrame(columns=['AE1','AE2','BE1','BE2','HE1','HE2'])
for m,n in zip(m_list, mutations):
df.loc[m]=n
AE=['AE1','AE2']
BE=['BE1','BE2']
HE=['HE1','HE2']
"""Get variants in AE and in HE but not in BE"""
AE_HE_notBE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["BE1"] != 1.0) & (df["BE2"] != 1.0)) & ((df["BE2"] != 0.5) | (df["BE2"] != 0.5))]
"""Get variants in AE and in BE but not in HE"""
AE_BE_notHE = df.loc[((df["AE1"] == 1.0) | (df["AE2"] == 1.0)) & ((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] != 1.0) & (df["HE2"] != 1.0)) & ((df["HE2"] != 0.5) | (df["HE2"] != 0.5))]
"""Get variants in BE and in HE but not in AE"""
BE_HE_notAE = df.loc[((df["BE1"] == 1.0) | (df["BE2"] == 1.0)) & ((df["HE1"] == 1.0) | (df["HE2"] == 1.0)) & ((df["AE1"] != 1.0) & (df["AE2"] != 1.0)) & ((df["AE2"] != 0.5) | (df["AE2"] != 0.5))]
This shows an extremely simplified minimal example of the problem. Multiple criteria are being used to select subsets of the df, I would like to perform a type of subset across a whole group of columns and a different type of subset on another group of columns, however this gets very messy when you have more than say 10 columns. A more realistic example is shown in the 1st example, as has been pointed out this is almost unreadable - which is exactly my point - is there a neater way to write this kind of complex query/subset where multiple columns require the same selective operation to be performed on them? I would appreciate any help.
python pandas
python pandas
edited Nov 15 '18 at 13:23
user3062260
asked Nov 15 '18 at 12:21
user3062260user3062260
435522
435522
I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset
– user3062260
Nov 15 '18 at 13:26
Consider checking thislink
and alsoquery
.
– Sandeep Kadapa
Nov 15 '18 at 13:45
add a comment |
I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset
– user3062260
Nov 15 '18 at 13:26
Consider checking thislink
and alsoquery
.
– Sandeep Kadapa
Nov 15 '18 at 13:45
I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset
– user3062260
Nov 15 '18 at 13:26
I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset
– user3062260
Nov 15 '18 at 13:26
Consider checking this
link
and also query
.– Sandeep Kadapa
Nov 15 '18 at 13:45
Consider checking this
link
and also query
.– Sandeep Kadapa
Nov 15 '18 at 13:45
add a comment |
1 Answer
1
active
oldest
votes
eq
+ any
/ all
+ loc
Vectorised, you can subset your dataframe and use equality and any
/ all
operations:
# Get variants in AE and in HE but not in BE
m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)
df_filtered = df.loc[m1 & m2 & m3]
If, as you describe, all your values are 0
, 0.5
or 1.0
, saying selected values cannot be 1.0
or 0.5
is the same as requiring them to be 0
.
This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.
– user3062260
Nov 15 '18 at 14:43
for the BE condition you can use something likedf[BE].ne(1).any(1) & df[BE].eq(0).any(1)
;ne
stands for "not equal".
– jpp
Nov 15 '18 at 14:45
1
This is perfect! Thanks for your help!
– user3062260
Nov 16 '18 at 11:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319422%2fcomplex-pandas-sub-setting-selecting-rows-that-match-criteria-in-many-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
eq
+ any
/ all
+ loc
Vectorised, you can subset your dataframe and use equality and any
/ all
operations:
# Get variants in AE and in HE but not in BE
m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)
df_filtered = df.loc[m1 & m2 & m3]
If, as you describe, all your values are 0
, 0.5
or 1.0
, saying selected values cannot be 1.0
or 0.5
is the same as requiring them to be 0
.
This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.
– user3062260
Nov 15 '18 at 14:43
for the BE condition you can use something likedf[BE].ne(1).any(1) & df[BE].eq(0).any(1)
;ne
stands for "not equal".
– jpp
Nov 15 '18 at 14:45
1
This is perfect! Thanks for your help!
– user3062260
Nov 16 '18 at 11:47
add a comment |
eq
+ any
/ all
+ loc
Vectorised, you can subset your dataframe and use equality and any
/ all
operations:
# Get variants in AE and in HE but not in BE
m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)
df_filtered = df.loc[m1 & m2 & m3]
If, as you describe, all your values are 0
, 0.5
or 1.0
, saying selected values cannot be 1.0
or 0.5
is the same as requiring them to be 0
.
This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.
– user3062260
Nov 15 '18 at 14:43
for the BE condition you can use something likedf[BE].ne(1).any(1) & df[BE].eq(0).any(1)
;ne
stands for "not equal".
– jpp
Nov 15 '18 at 14:45
1
This is perfect! Thanks for your help!
– user3062260
Nov 16 '18 at 11:47
add a comment |
eq
+ any
/ all
+ loc
Vectorised, you can subset your dataframe and use equality and any
/ all
operations:
# Get variants in AE and in HE but not in BE
m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)
df_filtered = df.loc[m1 & m2 & m3]
If, as you describe, all your values are 0
, 0.5
or 1.0
, saying selected values cannot be 1.0
or 0.5
is the same as requiring them to be 0
.
eq
+ any
/ all
+ loc
Vectorised, you can subset your dataframe and use equality and any
/ all
operations:
# Get variants in AE and in HE but not in BE
m1 = df[AE].eq(1.0).any(1)
m2 = df[HE].eq(1.0).any(1)
m3 = df[BE].eq(0).all(1)
df_filtered = df.loc[m1 & m2 & m3]
If, as you describe, all your values are 0
, 0.5
or 1.0
, saying selected values cannot be 1.0
or 0.5
is the same as requiring them to be 0
.
edited Nov 15 '18 at 14:15
answered Nov 15 '18 at 14:10
jppjpp
103k2167117
103k2167117
This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.
– user3062260
Nov 15 '18 at 14:43
for the BE condition you can use something likedf[BE].ne(1).any(1) & df[BE].eq(0).any(1)
;ne
stands for "not equal".
– jpp
Nov 15 '18 at 14:45
1
This is perfect! Thanks for your help!
– user3062260
Nov 16 '18 at 11:47
add a comment |
This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.
– user3062260
Nov 15 '18 at 14:43
for the BE condition you can use something likedf[BE].ne(1).any(1) & df[BE].eq(0).any(1)
;ne
stands for "not equal".
– jpp
Nov 15 '18 at 14:45
1
This is perfect! Thanks for your help!
– user3062260
Nov 16 '18 at 11:47
This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.
– user3062260
Nov 15 '18 at 14:43
This is a really great and concise piece of code - exactly what I'm looking for! My only question is: given that the 0.5 values are not a problem, as long as 'not all' columns in a given group such as 'AE' are 0.5. How can I use your code to build a subset where for example: in AE at least one col=1, in HE at least 1 col=1 and in BE no col can =1, but 1 BE must =0, however many cols are 0.5 doesn't matter as long as the right allocation of 1 and 0 is fullfilled.
– user3062260
Nov 15 '18 at 14:43
for the BE condition you can use something like
df[BE].ne(1).any(1) & df[BE].eq(0).any(1)
; ne
stands for "not equal".– jpp
Nov 15 '18 at 14:45
for the BE condition you can use something like
df[BE].ne(1).any(1) & df[BE].eq(0).any(1)
; ne
stands for "not equal".– jpp
Nov 15 '18 at 14:45
1
1
This is perfect! Thanks for your help!
– user3062260
Nov 16 '18 at 11:47
This is perfect! Thanks for your help!
– user3062260
Nov 16 '18 at 11:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53319422%2fcomplex-pandas-sub-setting-selecting-rows-that-match-criteria-in-many-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
I've provided a minimal example that can be copied and pasted into a python terminal. The code being almost non readable is exactly my point - I'm hopeful that someone else has come across this problem before and knows a more effective and readable way to write this kind of complex subset
– user3062260
Nov 15 '18 at 13:26
Consider checking this
link
and alsoquery
.– Sandeep Kadapa
Nov 15 '18 at 13:45