Selecting rows of a Pandas dataframe based on a dict
I have a pandas dataframe that only have two columns like this
Timestamp X
0 2017-01-01 00:00:00 18450
1 2017-01-01 00:10:00 13787
2 2017-01-01 00:20:00 3249
3 2017-01-01 00:30:00 44354
4 2017-01-01 00:40:00 50750
The Timestamp column is basically 10 minute separated ranging from start of the month to end. To create a sample the following code can be used.
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 00:00:00'), end=pd.Timestamp('2017-01-20 00:00:00'), freq='10T')
l_data['X'] = random.sample(range(0, 100000), len(l_data))
I have a dictionary with me like this
Timestamp('2017-01-18 01:37:19.160000'): Timestamp('2017-01-18 01:37:29.520000'),
Timestamp('2017-01-18 01:41:04.880000'): Timestamp('2017-01-18 01:41:10.280000'),
Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000'),
Timestamp('2017-01-18 21:40:02.120000'): Timestamp('2017-01-18 21:50:00.040000'),
Timestamp('2017-01-18 21:50:02.120000'): Timestamp('2017-01-18 22:00:00.040000'),
Timestamp('2017-01-18 22:00:02.120000'): Timestamp('2017-01-18 22:01:50.760000'),
Timestamp('2017-01-18 22:20:22.760000'): Timestamp('2017-01-18 22:25:20.760000'),
Timestamp('2017-01-18 22:35:52.800000'): Timestamp('2017-01-18 22:40:00.040000')
The key in the dictionary is starting time and value is the ending time. I want to create a column named L
based on this dict
in l_data
If the time between key and value in dict
is greater than 5 minute I have to mark the Timestamp that falls in that range in l_data
as 1.
How to implement this in pandas in a straight forward way rather than using multiple loops.?
Expected output will look like this
126 1/18/2017 21:00 43401 0
127 1/18/2017 21:10 290 0
128 1/18/2017 21:20 92509 0
129 1/18/2017 21:30 64545 0
130 1/18/2017 21:40 47780 1
131 1/18/2017 21:50 53293 1
132 1/18/2017 22:00 45634 0
133 1/18/2017 22:10 51462 0
134 1/18/2017 22:20 44736 0
135 1/18/2017 22:30 11697 1
136 1/18/2017 22:40 82587 1
137 1/18/2017 22:50 76250 0
138 1/18/2017 23:00 33307 0
139 1/18/2017 23:10 25851 0
140 1/18/2017 23:20 71131 0
141 1/18/2017 23:30 88015 0
142 1/18/2017 23:40 45577 0
143 1/18/2017 23:50 76761 0
144 1/19/2017 0:00 45363 0
Only significant rows are shown
python pandas
|
show 3 more comments
I have a pandas dataframe that only have two columns like this
Timestamp X
0 2017-01-01 00:00:00 18450
1 2017-01-01 00:10:00 13787
2 2017-01-01 00:20:00 3249
3 2017-01-01 00:30:00 44354
4 2017-01-01 00:40:00 50750
The Timestamp column is basically 10 minute separated ranging from start of the month to end. To create a sample the following code can be used.
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 00:00:00'), end=pd.Timestamp('2017-01-20 00:00:00'), freq='10T')
l_data['X'] = random.sample(range(0, 100000), len(l_data))
I have a dictionary with me like this
Timestamp('2017-01-18 01:37:19.160000'): Timestamp('2017-01-18 01:37:29.520000'),
Timestamp('2017-01-18 01:41:04.880000'): Timestamp('2017-01-18 01:41:10.280000'),
Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000'),
Timestamp('2017-01-18 21:40:02.120000'): Timestamp('2017-01-18 21:50:00.040000'),
Timestamp('2017-01-18 21:50:02.120000'): Timestamp('2017-01-18 22:00:00.040000'),
Timestamp('2017-01-18 22:00:02.120000'): Timestamp('2017-01-18 22:01:50.760000'),
Timestamp('2017-01-18 22:20:22.760000'): Timestamp('2017-01-18 22:25:20.760000'),
Timestamp('2017-01-18 22:35:52.800000'): Timestamp('2017-01-18 22:40:00.040000')
The key in the dictionary is starting time and value is the ending time. I want to create a column named L
based on this dict
in l_data
If the time between key and value in dict
is greater than 5 minute I have to mark the Timestamp that falls in that range in l_data
as 1.
How to implement this in pandas in a straight forward way rather than using multiple loops.?
Expected output will look like this
126 1/18/2017 21:00 43401 0
127 1/18/2017 21:10 290 0
128 1/18/2017 21:20 92509 0
129 1/18/2017 21:30 64545 0
130 1/18/2017 21:40 47780 1
131 1/18/2017 21:50 53293 1
132 1/18/2017 22:00 45634 0
133 1/18/2017 22:10 51462 0
134 1/18/2017 22:20 44736 0
135 1/18/2017 22:30 11697 1
136 1/18/2017 22:40 82587 1
137 1/18/2017 22:50 76250 0
138 1/18/2017 23:00 33307 0
139 1/18/2017 23:10 25851 0
140 1/18/2017 23:20 71131 0
141 1/18/2017 23:30 88015 0
142 1/18/2017 23:40 45577 0
143 1/18/2017 23:50 76761 0
144 1/19/2017 0:00 45363 0
Only significant rows are shown
python pandas
Can you please make an example with 5 rows instead of 5000? Then you could post the expected output and I could verify my answer before posting it.
– timgeb
Nov 15 '18 at 6:25
5 rows wont give the expected distribution of start and end times. That's why I went for 5000
– Sreeram TP
Nov 15 '18 at 6:26
How does this matter for the algorithm?
– timgeb
Nov 15 '18 at 6:26
Sometimes start and end will lie within 10 mints range some times it will be between two timestamps.
– Sreeram TP
Nov 15 '18 at 6:28
1
reduced the number of rows
– Sreeram TP
Nov 15 '18 at 6:31
|
show 3 more comments
I have a pandas dataframe that only have two columns like this
Timestamp X
0 2017-01-01 00:00:00 18450
1 2017-01-01 00:10:00 13787
2 2017-01-01 00:20:00 3249
3 2017-01-01 00:30:00 44354
4 2017-01-01 00:40:00 50750
The Timestamp column is basically 10 minute separated ranging from start of the month to end. To create a sample the following code can be used.
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 00:00:00'), end=pd.Timestamp('2017-01-20 00:00:00'), freq='10T')
l_data['X'] = random.sample(range(0, 100000), len(l_data))
I have a dictionary with me like this
Timestamp('2017-01-18 01:37:19.160000'): Timestamp('2017-01-18 01:37:29.520000'),
Timestamp('2017-01-18 01:41:04.880000'): Timestamp('2017-01-18 01:41:10.280000'),
Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000'),
Timestamp('2017-01-18 21:40:02.120000'): Timestamp('2017-01-18 21:50:00.040000'),
Timestamp('2017-01-18 21:50:02.120000'): Timestamp('2017-01-18 22:00:00.040000'),
Timestamp('2017-01-18 22:00:02.120000'): Timestamp('2017-01-18 22:01:50.760000'),
Timestamp('2017-01-18 22:20:22.760000'): Timestamp('2017-01-18 22:25:20.760000'),
Timestamp('2017-01-18 22:35:52.800000'): Timestamp('2017-01-18 22:40:00.040000')
The key in the dictionary is starting time and value is the ending time. I want to create a column named L
based on this dict
in l_data
If the time between key and value in dict
is greater than 5 minute I have to mark the Timestamp that falls in that range in l_data
as 1.
How to implement this in pandas in a straight forward way rather than using multiple loops.?
Expected output will look like this
126 1/18/2017 21:00 43401 0
127 1/18/2017 21:10 290 0
128 1/18/2017 21:20 92509 0
129 1/18/2017 21:30 64545 0
130 1/18/2017 21:40 47780 1
131 1/18/2017 21:50 53293 1
132 1/18/2017 22:00 45634 0
133 1/18/2017 22:10 51462 0
134 1/18/2017 22:20 44736 0
135 1/18/2017 22:30 11697 1
136 1/18/2017 22:40 82587 1
137 1/18/2017 22:50 76250 0
138 1/18/2017 23:00 33307 0
139 1/18/2017 23:10 25851 0
140 1/18/2017 23:20 71131 0
141 1/18/2017 23:30 88015 0
142 1/18/2017 23:40 45577 0
143 1/18/2017 23:50 76761 0
144 1/19/2017 0:00 45363 0
Only significant rows are shown
python pandas
I have a pandas dataframe that only have two columns like this
Timestamp X
0 2017-01-01 00:00:00 18450
1 2017-01-01 00:10:00 13787
2 2017-01-01 00:20:00 3249
3 2017-01-01 00:30:00 44354
4 2017-01-01 00:40:00 50750
The Timestamp column is basically 10 minute separated ranging from start of the month to end. To create a sample the following code can be used.
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 00:00:00'), end=pd.Timestamp('2017-01-20 00:00:00'), freq='10T')
l_data['X'] = random.sample(range(0, 100000), len(l_data))
I have a dictionary with me like this
Timestamp('2017-01-18 01:37:19.160000'): Timestamp('2017-01-18 01:37:29.520000'),
Timestamp('2017-01-18 01:41:04.880000'): Timestamp('2017-01-18 01:41:10.280000'),
Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000'),
Timestamp('2017-01-18 21:40:02.120000'): Timestamp('2017-01-18 21:50:00.040000'),
Timestamp('2017-01-18 21:50:02.120000'): Timestamp('2017-01-18 22:00:00.040000'),
Timestamp('2017-01-18 22:00:02.120000'): Timestamp('2017-01-18 22:01:50.760000'),
Timestamp('2017-01-18 22:20:22.760000'): Timestamp('2017-01-18 22:25:20.760000'),
Timestamp('2017-01-18 22:35:52.800000'): Timestamp('2017-01-18 22:40:00.040000')
The key in the dictionary is starting time and value is the ending time. I want to create a column named L
based on this dict
in l_data
If the time between key and value in dict
is greater than 5 minute I have to mark the Timestamp that falls in that range in l_data
as 1.
How to implement this in pandas in a straight forward way rather than using multiple loops.?
Expected output will look like this
126 1/18/2017 21:00 43401 0
127 1/18/2017 21:10 290 0
128 1/18/2017 21:20 92509 0
129 1/18/2017 21:30 64545 0
130 1/18/2017 21:40 47780 1
131 1/18/2017 21:50 53293 1
132 1/18/2017 22:00 45634 0
133 1/18/2017 22:10 51462 0
134 1/18/2017 22:20 44736 0
135 1/18/2017 22:30 11697 1
136 1/18/2017 22:40 82587 1
137 1/18/2017 22:50 76250 0
138 1/18/2017 23:00 33307 0
139 1/18/2017 23:10 25851 0
140 1/18/2017 23:20 71131 0
141 1/18/2017 23:30 88015 0
142 1/18/2017 23:40 45577 0
143 1/18/2017 23:50 76761 0
144 1/19/2017 0:00 45363 0
Only significant rows are shown
python pandas
python pandas
edited Nov 15 '18 at 6:49
Sreeram TP
asked Nov 15 '18 at 6:22
Sreeram TPSreeram TP
3,32431441
3,32431441
Can you please make an example with 5 rows instead of 5000? Then you could post the expected output and I could verify my answer before posting it.
– timgeb
Nov 15 '18 at 6:25
5 rows wont give the expected distribution of start and end times. That's why I went for 5000
– Sreeram TP
Nov 15 '18 at 6:26
How does this matter for the algorithm?
– timgeb
Nov 15 '18 at 6:26
Sometimes start and end will lie within 10 mints range some times it will be between two timestamps.
– Sreeram TP
Nov 15 '18 at 6:28
1
reduced the number of rows
– Sreeram TP
Nov 15 '18 at 6:31
|
show 3 more comments
Can you please make an example with 5 rows instead of 5000? Then you could post the expected output and I could verify my answer before posting it.
– timgeb
Nov 15 '18 at 6:25
5 rows wont give the expected distribution of start and end times. That's why I went for 5000
– Sreeram TP
Nov 15 '18 at 6:26
How does this matter for the algorithm?
– timgeb
Nov 15 '18 at 6:26
Sometimes start and end will lie within 10 mints range some times it will be between two timestamps.
– Sreeram TP
Nov 15 '18 at 6:28
1
reduced the number of rows
– Sreeram TP
Nov 15 '18 at 6:31
Can you please make an example with 5 rows instead of 5000? Then you could post the expected output and I could verify my answer before posting it.
– timgeb
Nov 15 '18 at 6:25
Can you please make an example with 5 rows instead of 5000? Then you could post the expected output and I could verify my answer before posting it.
– timgeb
Nov 15 '18 at 6:25
5 rows wont give the expected distribution of start and end times. That's why I went for 5000
– Sreeram TP
Nov 15 '18 at 6:26
5 rows wont give the expected distribution of start and end times. That's why I went for 5000
– Sreeram TP
Nov 15 '18 at 6:26
How does this matter for the algorithm?
– timgeb
Nov 15 '18 at 6:26
How does this matter for the algorithm?
– timgeb
Nov 15 '18 at 6:26
Sometimes start and end will lie within 10 mints range some times it will be between two timestamps.
– Sreeram TP
Nov 15 '18 at 6:28
Sometimes start and end will lie within 10 mints range some times it will be between two timestamps.
– Sreeram TP
Nov 15 '18 at 6:28
1
1
reduced the number of rows
– Sreeram TP
Nov 15 '18 at 6:31
reduced the number of rows
– Sreeram TP
Nov 15 '18 at 6:31
|
show 3 more comments
1 Answer
1
active
oldest
votes
I believe you need:
d = pd.Timestamp('2017-01-18 21:45:02.120000'): pd.Timestamp('2017-01-18 21:50:29.040000'),
pd.Timestamp('2017-01-18 21:51:02.120000'): pd.Timestamp('2017-01-18 22:52:00.040000'),
pd.Timestamp('2017-01-18 22:52:02.120000'): pd.Timestamp('2017-01-18 22:57:59.760000'),
pd.Timestamp('2017-01-18 23:41:52.800000'): pd.Timestamp('2017-01-18 23:43:00.040000'),
pd.Timestamp('2017-01-18 23:44:52.800000'): pd.Timestamp('2017-01-18 23:50:30.040000'),
pd.Timestamp('2017-01-19 01:10:32.800000'): pd.Timestamp('2017-01-19 01:11:30.040000'),
pd.Timestamp('2017-01-19 01:40:32.800000'): pd.Timestamp('2017-01-19 01:55:30.040000'),
pd.Timestamp('2017-01-19 01:57:32.800000'): pd.Timestamp('2017-01-19 02:04:30.040000')
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 20:00:00'),
end=pd.Timestamp('2017-01-19 04:00:00'), freq='10T')
l_data['expected'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#print (l_data)
df = pd.DataFrame('start': list(d.keys()),'end': list(d.values()))
#fikter by 5 minutes
df = df[(df['end'] - df['start']) > pd.Timedelta(5*60, 's')]
#correct 1 minutes end time
s = df['end'].dt.floor('10T')
df['end1'] = s.where((df['end'] - s) < pd.Timedelta(60, 's'), s + pd.Timedelta(10*60, 's'))
print (df)
start end end1
0 2017-01-18 21:45:02.120 2017-01-18 21:50:29.040 2017-01-18 21:50:00
1 2017-01-18 21:51:02.120 2017-01-18 22:52:00.040 2017-01-18 23:00:00
2 2017-01-18 22:52:02.120 2017-01-18 22:57:59.760 2017-01-18 23:00:00
4 2017-01-18 23:44:52.800 2017-01-18 23:50:30.040 2017-01-18 23:50:00
6 2017-01-19 01:40:32.800 2017-01-19 01:55:30.040 2017-01-19 02:00:00
7 2017-01-19 01:57:32.800 2017-01-19 02:04:30.040 2017-01-19 02:10:00
#for each group resample by 10min and add missimg datetimes
v = (df.reset_index()[['start','end1','index']]
.melt('index')
.set_index('value')
.groupby('index')
.resample('10T')['index']
.ffill()
.dropna()
.index
.get_level_values(1)
.unique()
)
#print (v)
l_data['L'] = l_data['Timestamp'].isin(v).astype(int)
print (l_data.head(20))
Timestamp expected L
0 2017-01-18 20:00:00 0 0
1 2017-01-18 20:10:00 0 0
2 2017-01-18 20:20:00 0 0
3 2017-01-18 20:30:00 0 0
4 2017-01-18 20:40:00 0 0
5 2017-01-18 20:50:00 0 0
6 2017-01-18 21:00:00 0 0
7 2017-01-18 21:10:00 0 0
8 2017-01-18 21:20:00 0 0
9 2017-01-18 21:30:00 0 0
10 2017-01-18 21:40:00 0 0
11 2017-01-18 21:50:00 1 1
12 2017-01-18 22:00:00 1 1
13 2017-01-18 22:10:00 1 1
14 2017-01-18 22:20:00 1 1
15 2017-01-18 22:30:00 1 1
16 2017-01-18 22:40:00 1 1
17 2017-01-18 22:50:00 1 1
18 2017-01-18 23:00:00 1 1
19 2017-01-18 23:10:00 0 0
1
Give me some time to verify the result
– Sreeram TP
Nov 15 '18 at 7:09
@SreeramTP - sure, not sure about condition, noew i have less like 5 minutes, maybe need more - then change<
to>
in conditions.
– jezrael
Nov 15 '18 at 7:10
SinceTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000')
is more than 5 minutes i want the timestampTimestamp('2017-01-18 21:40:00.040000')
marked as 1
– Sreeram TP
Nov 15 '18 at 7:11
1
oops, so not understand it. give me some time.
– jezrael
Nov 15 '18 at 7:12
The conditions are if the value - key is more than 5 mints mark the time inl_data
that corresponds to value. There may be samples like thisTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:43:00.040000')
. The difference is more than 5 mints but value is pastTimestamp('2017-01-18 21:40:00.040000')
then markTimestamp('2017-01-18 21:50:00.040000')
– Sreeram TP
Nov 15 '18 at 7:16
|
show 15 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53313565%2fselecting-rows-of-a-pandas-dataframe-based-on-a-dict%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I believe you need:
d = pd.Timestamp('2017-01-18 21:45:02.120000'): pd.Timestamp('2017-01-18 21:50:29.040000'),
pd.Timestamp('2017-01-18 21:51:02.120000'): pd.Timestamp('2017-01-18 22:52:00.040000'),
pd.Timestamp('2017-01-18 22:52:02.120000'): pd.Timestamp('2017-01-18 22:57:59.760000'),
pd.Timestamp('2017-01-18 23:41:52.800000'): pd.Timestamp('2017-01-18 23:43:00.040000'),
pd.Timestamp('2017-01-18 23:44:52.800000'): pd.Timestamp('2017-01-18 23:50:30.040000'),
pd.Timestamp('2017-01-19 01:10:32.800000'): pd.Timestamp('2017-01-19 01:11:30.040000'),
pd.Timestamp('2017-01-19 01:40:32.800000'): pd.Timestamp('2017-01-19 01:55:30.040000'),
pd.Timestamp('2017-01-19 01:57:32.800000'): pd.Timestamp('2017-01-19 02:04:30.040000')
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 20:00:00'),
end=pd.Timestamp('2017-01-19 04:00:00'), freq='10T')
l_data['expected'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#print (l_data)
df = pd.DataFrame('start': list(d.keys()),'end': list(d.values()))
#fikter by 5 minutes
df = df[(df['end'] - df['start']) > pd.Timedelta(5*60, 's')]
#correct 1 minutes end time
s = df['end'].dt.floor('10T')
df['end1'] = s.where((df['end'] - s) < pd.Timedelta(60, 's'), s + pd.Timedelta(10*60, 's'))
print (df)
start end end1
0 2017-01-18 21:45:02.120 2017-01-18 21:50:29.040 2017-01-18 21:50:00
1 2017-01-18 21:51:02.120 2017-01-18 22:52:00.040 2017-01-18 23:00:00
2 2017-01-18 22:52:02.120 2017-01-18 22:57:59.760 2017-01-18 23:00:00
4 2017-01-18 23:44:52.800 2017-01-18 23:50:30.040 2017-01-18 23:50:00
6 2017-01-19 01:40:32.800 2017-01-19 01:55:30.040 2017-01-19 02:00:00
7 2017-01-19 01:57:32.800 2017-01-19 02:04:30.040 2017-01-19 02:10:00
#for each group resample by 10min and add missimg datetimes
v = (df.reset_index()[['start','end1','index']]
.melt('index')
.set_index('value')
.groupby('index')
.resample('10T')['index']
.ffill()
.dropna()
.index
.get_level_values(1)
.unique()
)
#print (v)
l_data['L'] = l_data['Timestamp'].isin(v).astype(int)
print (l_data.head(20))
Timestamp expected L
0 2017-01-18 20:00:00 0 0
1 2017-01-18 20:10:00 0 0
2 2017-01-18 20:20:00 0 0
3 2017-01-18 20:30:00 0 0
4 2017-01-18 20:40:00 0 0
5 2017-01-18 20:50:00 0 0
6 2017-01-18 21:00:00 0 0
7 2017-01-18 21:10:00 0 0
8 2017-01-18 21:20:00 0 0
9 2017-01-18 21:30:00 0 0
10 2017-01-18 21:40:00 0 0
11 2017-01-18 21:50:00 1 1
12 2017-01-18 22:00:00 1 1
13 2017-01-18 22:10:00 1 1
14 2017-01-18 22:20:00 1 1
15 2017-01-18 22:30:00 1 1
16 2017-01-18 22:40:00 1 1
17 2017-01-18 22:50:00 1 1
18 2017-01-18 23:00:00 1 1
19 2017-01-18 23:10:00 0 0
1
Give me some time to verify the result
– Sreeram TP
Nov 15 '18 at 7:09
@SreeramTP - sure, not sure about condition, noew i have less like 5 minutes, maybe need more - then change<
to>
in conditions.
– jezrael
Nov 15 '18 at 7:10
SinceTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000')
is more than 5 minutes i want the timestampTimestamp('2017-01-18 21:40:00.040000')
marked as 1
– Sreeram TP
Nov 15 '18 at 7:11
1
oops, so not understand it. give me some time.
– jezrael
Nov 15 '18 at 7:12
The conditions are if the value - key is more than 5 mints mark the time inl_data
that corresponds to value. There may be samples like thisTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:43:00.040000')
. The difference is more than 5 mints but value is pastTimestamp('2017-01-18 21:40:00.040000')
then markTimestamp('2017-01-18 21:50:00.040000')
– Sreeram TP
Nov 15 '18 at 7:16
|
show 15 more comments
I believe you need:
d = pd.Timestamp('2017-01-18 21:45:02.120000'): pd.Timestamp('2017-01-18 21:50:29.040000'),
pd.Timestamp('2017-01-18 21:51:02.120000'): pd.Timestamp('2017-01-18 22:52:00.040000'),
pd.Timestamp('2017-01-18 22:52:02.120000'): pd.Timestamp('2017-01-18 22:57:59.760000'),
pd.Timestamp('2017-01-18 23:41:52.800000'): pd.Timestamp('2017-01-18 23:43:00.040000'),
pd.Timestamp('2017-01-18 23:44:52.800000'): pd.Timestamp('2017-01-18 23:50:30.040000'),
pd.Timestamp('2017-01-19 01:10:32.800000'): pd.Timestamp('2017-01-19 01:11:30.040000'),
pd.Timestamp('2017-01-19 01:40:32.800000'): pd.Timestamp('2017-01-19 01:55:30.040000'),
pd.Timestamp('2017-01-19 01:57:32.800000'): pd.Timestamp('2017-01-19 02:04:30.040000')
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 20:00:00'),
end=pd.Timestamp('2017-01-19 04:00:00'), freq='10T')
l_data['expected'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#print (l_data)
df = pd.DataFrame('start': list(d.keys()),'end': list(d.values()))
#fikter by 5 minutes
df = df[(df['end'] - df['start']) > pd.Timedelta(5*60, 's')]
#correct 1 minutes end time
s = df['end'].dt.floor('10T')
df['end1'] = s.where((df['end'] - s) < pd.Timedelta(60, 's'), s + pd.Timedelta(10*60, 's'))
print (df)
start end end1
0 2017-01-18 21:45:02.120 2017-01-18 21:50:29.040 2017-01-18 21:50:00
1 2017-01-18 21:51:02.120 2017-01-18 22:52:00.040 2017-01-18 23:00:00
2 2017-01-18 22:52:02.120 2017-01-18 22:57:59.760 2017-01-18 23:00:00
4 2017-01-18 23:44:52.800 2017-01-18 23:50:30.040 2017-01-18 23:50:00
6 2017-01-19 01:40:32.800 2017-01-19 01:55:30.040 2017-01-19 02:00:00
7 2017-01-19 01:57:32.800 2017-01-19 02:04:30.040 2017-01-19 02:10:00
#for each group resample by 10min and add missimg datetimes
v = (df.reset_index()[['start','end1','index']]
.melt('index')
.set_index('value')
.groupby('index')
.resample('10T')['index']
.ffill()
.dropna()
.index
.get_level_values(1)
.unique()
)
#print (v)
l_data['L'] = l_data['Timestamp'].isin(v).astype(int)
print (l_data.head(20))
Timestamp expected L
0 2017-01-18 20:00:00 0 0
1 2017-01-18 20:10:00 0 0
2 2017-01-18 20:20:00 0 0
3 2017-01-18 20:30:00 0 0
4 2017-01-18 20:40:00 0 0
5 2017-01-18 20:50:00 0 0
6 2017-01-18 21:00:00 0 0
7 2017-01-18 21:10:00 0 0
8 2017-01-18 21:20:00 0 0
9 2017-01-18 21:30:00 0 0
10 2017-01-18 21:40:00 0 0
11 2017-01-18 21:50:00 1 1
12 2017-01-18 22:00:00 1 1
13 2017-01-18 22:10:00 1 1
14 2017-01-18 22:20:00 1 1
15 2017-01-18 22:30:00 1 1
16 2017-01-18 22:40:00 1 1
17 2017-01-18 22:50:00 1 1
18 2017-01-18 23:00:00 1 1
19 2017-01-18 23:10:00 0 0
1
Give me some time to verify the result
– Sreeram TP
Nov 15 '18 at 7:09
@SreeramTP - sure, not sure about condition, noew i have less like 5 minutes, maybe need more - then change<
to>
in conditions.
– jezrael
Nov 15 '18 at 7:10
SinceTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000')
is more than 5 minutes i want the timestampTimestamp('2017-01-18 21:40:00.040000')
marked as 1
– Sreeram TP
Nov 15 '18 at 7:11
1
oops, so not understand it. give me some time.
– jezrael
Nov 15 '18 at 7:12
The conditions are if the value - key is more than 5 mints mark the time inl_data
that corresponds to value. There may be samples like thisTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:43:00.040000')
. The difference is more than 5 mints but value is pastTimestamp('2017-01-18 21:40:00.040000')
then markTimestamp('2017-01-18 21:50:00.040000')
– Sreeram TP
Nov 15 '18 at 7:16
|
show 15 more comments
I believe you need:
d = pd.Timestamp('2017-01-18 21:45:02.120000'): pd.Timestamp('2017-01-18 21:50:29.040000'),
pd.Timestamp('2017-01-18 21:51:02.120000'): pd.Timestamp('2017-01-18 22:52:00.040000'),
pd.Timestamp('2017-01-18 22:52:02.120000'): pd.Timestamp('2017-01-18 22:57:59.760000'),
pd.Timestamp('2017-01-18 23:41:52.800000'): pd.Timestamp('2017-01-18 23:43:00.040000'),
pd.Timestamp('2017-01-18 23:44:52.800000'): pd.Timestamp('2017-01-18 23:50:30.040000'),
pd.Timestamp('2017-01-19 01:10:32.800000'): pd.Timestamp('2017-01-19 01:11:30.040000'),
pd.Timestamp('2017-01-19 01:40:32.800000'): pd.Timestamp('2017-01-19 01:55:30.040000'),
pd.Timestamp('2017-01-19 01:57:32.800000'): pd.Timestamp('2017-01-19 02:04:30.040000')
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 20:00:00'),
end=pd.Timestamp('2017-01-19 04:00:00'), freq='10T')
l_data['expected'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#print (l_data)
df = pd.DataFrame('start': list(d.keys()),'end': list(d.values()))
#fikter by 5 minutes
df = df[(df['end'] - df['start']) > pd.Timedelta(5*60, 's')]
#correct 1 minutes end time
s = df['end'].dt.floor('10T')
df['end1'] = s.where((df['end'] - s) < pd.Timedelta(60, 's'), s + pd.Timedelta(10*60, 's'))
print (df)
start end end1
0 2017-01-18 21:45:02.120 2017-01-18 21:50:29.040 2017-01-18 21:50:00
1 2017-01-18 21:51:02.120 2017-01-18 22:52:00.040 2017-01-18 23:00:00
2 2017-01-18 22:52:02.120 2017-01-18 22:57:59.760 2017-01-18 23:00:00
4 2017-01-18 23:44:52.800 2017-01-18 23:50:30.040 2017-01-18 23:50:00
6 2017-01-19 01:40:32.800 2017-01-19 01:55:30.040 2017-01-19 02:00:00
7 2017-01-19 01:57:32.800 2017-01-19 02:04:30.040 2017-01-19 02:10:00
#for each group resample by 10min and add missimg datetimes
v = (df.reset_index()[['start','end1','index']]
.melt('index')
.set_index('value')
.groupby('index')
.resample('10T')['index']
.ffill()
.dropna()
.index
.get_level_values(1)
.unique()
)
#print (v)
l_data['L'] = l_data['Timestamp'].isin(v).astype(int)
print (l_data.head(20))
Timestamp expected L
0 2017-01-18 20:00:00 0 0
1 2017-01-18 20:10:00 0 0
2 2017-01-18 20:20:00 0 0
3 2017-01-18 20:30:00 0 0
4 2017-01-18 20:40:00 0 0
5 2017-01-18 20:50:00 0 0
6 2017-01-18 21:00:00 0 0
7 2017-01-18 21:10:00 0 0
8 2017-01-18 21:20:00 0 0
9 2017-01-18 21:30:00 0 0
10 2017-01-18 21:40:00 0 0
11 2017-01-18 21:50:00 1 1
12 2017-01-18 22:00:00 1 1
13 2017-01-18 22:10:00 1 1
14 2017-01-18 22:20:00 1 1
15 2017-01-18 22:30:00 1 1
16 2017-01-18 22:40:00 1 1
17 2017-01-18 22:50:00 1 1
18 2017-01-18 23:00:00 1 1
19 2017-01-18 23:10:00 0 0
I believe you need:
d = pd.Timestamp('2017-01-18 21:45:02.120000'): pd.Timestamp('2017-01-18 21:50:29.040000'),
pd.Timestamp('2017-01-18 21:51:02.120000'): pd.Timestamp('2017-01-18 22:52:00.040000'),
pd.Timestamp('2017-01-18 22:52:02.120000'): pd.Timestamp('2017-01-18 22:57:59.760000'),
pd.Timestamp('2017-01-18 23:41:52.800000'): pd.Timestamp('2017-01-18 23:43:00.040000'),
pd.Timestamp('2017-01-18 23:44:52.800000'): pd.Timestamp('2017-01-18 23:50:30.040000'),
pd.Timestamp('2017-01-19 01:10:32.800000'): pd.Timestamp('2017-01-19 01:11:30.040000'),
pd.Timestamp('2017-01-19 01:40:32.800000'): pd.Timestamp('2017-01-19 01:55:30.040000'),
pd.Timestamp('2017-01-19 01:57:32.800000'): pd.Timestamp('2017-01-19 02:04:30.040000')
l_data = pd.DataFrame()
l_data['Timestamp'] = pd.date_range(start=pd.Timestamp('2017-01-18 20:00:00'),
end=pd.Timestamp('2017-01-19 04:00:00'), freq='10T')
l_data['expected'] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
#print (l_data)
df = pd.DataFrame('start': list(d.keys()),'end': list(d.values()))
#fikter by 5 minutes
df = df[(df['end'] - df['start']) > pd.Timedelta(5*60, 's')]
#correct 1 minutes end time
s = df['end'].dt.floor('10T')
df['end1'] = s.where((df['end'] - s) < pd.Timedelta(60, 's'), s + pd.Timedelta(10*60, 's'))
print (df)
start end end1
0 2017-01-18 21:45:02.120 2017-01-18 21:50:29.040 2017-01-18 21:50:00
1 2017-01-18 21:51:02.120 2017-01-18 22:52:00.040 2017-01-18 23:00:00
2 2017-01-18 22:52:02.120 2017-01-18 22:57:59.760 2017-01-18 23:00:00
4 2017-01-18 23:44:52.800 2017-01-18 23:50:30.040 2017-01-18 23:50:00
6 2017-01-19 01:40:32.800 2017-01-19 01:55:30.040 2017-01-19 02:00:00
7 2017-01-19 01:57:32.800 2017-01-19 02:04:30.040 2017-01-19 02:10:00
#for each group resample by 10min and add missimg datetimes
v = (df.reset_index()[['start','end1','index']]
.melt('index')
.set_index('value')
.groupby('index')
.resample('10T')['index']
.ffill()
.dropna()
.index
.get_level_values(1)
.unique()
)
#print (v)
l_data['L'] = l_data['Timestamp'].isin(v).astype(int)
print (l_data.head(20))
Timestamp expected L
0 2017-01-18 20:00:00 0 0
1 2017-01-18 20:10:00 0 0
2 2017-01-18 20:20:00 0 0
3 2017-01-18 20:30:00 0 0
4 2017-01-18 20:40:00 0 0
5 2017-01-18 20:50:00 0 0
6 2017-01-18 21:00:00 0 0
7 2017-01-18 21:10:00 0 0
8 2017-01-18 21:20:00 0 0
9 2017-01-18 21:30:00 0 0
10 2017-01-18 21:40:00 0 0
11 2017-01-18 21:50:00 1 1
12 2017-01-18 22:00:00 1 1
13 2017-01-18 22:10:00 1 1
14 2017-01-18 22:20:00 1 1
15 2017-01-18 22:30:00 1 1
16 2017-01-18 22:40:00 1 1
17 2017-01-18 22:50:00 1 1
18 2017-01-18 23:00:00 1 1
19 2017-01-18 23:10:00 0 0
edited Nov 15 '18 at 14:06
answered Nov 15 '18 at 7:00
jezraeljezrael
354k26318394
354k26318394
1
Give me some time to verify the result
– Sreeram TP
Nov 15 '18 at 7:09
@SreeramTP - sure, not sure about condition, noew i have less like 5 minutes, maybe need more - then change<
to>
in conditions.
– jezrael
Nov 15 '18 at 7:10
SinceTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000')
is more than 5 minutes i want the timestampTimestamp('2017-01-18 21:40:00.040000')
marked as 1
– Sreeram TP
Nov 15 '18 at 7:11
1
oops, so not understand it. give me some time.
– jezrael
Nov 15 '18 at 7:12
The conditions are if the value - key is more than 5 mints mark the time inl_data
that corresponds to value. There may be samples like thisTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:43:00.040000')
. The difference is more than 5 mints but value is pastTimestamp('2017-01-18 21:40:00.040000')
then markTimestamp('2017-01-18 21:50:00.040000')
– Sreeram TP
Nov 15 '18 at 7:16
|
show 15 more comments
1
Give me some time to verify the result
– Sreeram TP
Nov 15 '18 at 7:09
@SreeramTP - sure, not sure about condition, noew i have less like 5 minutes, maybe need more - then change<
to>
in conditions.
– jezrael
Nov 15 '18 at 7:10
SinceTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000')
is more than 5 minutes i want the timestampTimestamp('2017-01-18 21:40:00.040000')
marked as 1
– Sreeram TP
Nov 15 '18 at 7:11
1
oops, so not understand it. give me some time.
– jezrael
Nov 15 '18 at 7:12
The conditions are if the value - key is more than 5 mints mark the time inl_data
that corresponds to value. There may be samples like thisTimestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:43:00.040000')
. The difference is more than 5 mints but value is pastTimestamp('2017-01-18 21:40:00.040000')
then markTimestamp('2017-01-18 21:50:00.040000')
– Sreeram TP
Nov 15 '18 at 7:16
1
1
Give me some time to verify the result
– Sreeram TP
Nov 15 '18 at 7:09
Give me some time to verify the result
– Sreeram TP
Nov 15 '18 at 7:09
@SreeramTP - sure, not sure about condition, noew i have less like 5 minutes, maybe need more - then change
<
to >
in conditions.– jezrael
Nov 15 '18 at 7:10
@SreeramTP - sure, not sure about condition, noew i have less like 5 minutes, maybe need more - then change
<
to >
in conditions.– jezrael
Nov 15 '18 at 7:10
Since
Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000')
is more than 5 minutes i want the timestamp Timestamp('2017-01-18 21:40:00.040000')
marked as 1– Sreeram TP
Nov 15 '18 at 7:11
Since
Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:40:00.040000')
is more than 5 minutes i want the timestamp Timestamp('2017-01-18 21:40:00.040000')
marked as 1– Sreeram TP
Nov 15 '18 at 7:11
1
1
oops, so not understand it. give me some time.
– jezrael
Nov 15 '18 at 7:12
oops, so not understand it. give me some time.
– jezrael
Nov 15 '18 at 7:12
The conditions are if the value - key is more than 5 mints mark the time in
l_data
that corresponds to value. There may be samples like this Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:43:00.040000')
. The difference is more than 5 mints but value is past Timestamp('2017-01-18 21:40:00.040000')
then mark Timestamp('2017-01-18 21:50:00.040000')
– Sreeram TP
Nov 15 '18 at 7:16
The conditions are if the value - key is more than 5 mints mark the time in
l_data
that corresponds to value. There may be samples like this Timestamp('2017-01-18 21:33:52.800000'): Timestamp('2017-01-18 21:43:00.040000')
. The difference is more than 5 mints but value is past Timestamp('2017-01-18 21:40:00.040000')
then mark Timestamp('2017-01-18 21:50:00.040000')
– Sreeram TP
Nov 15 '18 at 7:16
|
show 15 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53313565%2fselecting-rows-of-a-pandas-dataframe-based-on-a-dict%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you please make an example with 5 rows instead of 5000? Then you could post the expected output and I could verify my answer before posting it.
– timgeb
Nov 15 '18 at 6:25
5 rows wont give the expected distribution of start and end times. That's why I went for 5000
– Sreeram TP
Nov 15 '18 at 6:26
How does this matter for the algorithm?
– timgeb
Nov 15 '18 at 6:26
Sometimes start and end will lie within 10 mints range some times it will be between two timestamps.
– Sreeram TP
Nov 15 '18 at 6:28
1
reduced the number of rows
– Sreeram TP
Nov 15 '18 at 6:31