skip lines from csv file if it contains specific keyword in Pyspark
I have a CSV file with the details as shown below:
emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000
I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:
+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+
I can load this file into dataframe and can filter using below expression for each column
newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))
I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.
python-3.x csv pyspark
add a comment |
I have a CSV file with the details as shown below:
emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000
I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:
+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+
I can load this file into dataframe and can filter using below expression for each column
newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))
I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.
python-3.x csv pyspark
1
Have you tried to read it tordd
andfilter
the result usinglambda
function before creating dataframe? Something like this:rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())
– Alla Tarighati
Nov 13 '18 at 7:31
Yeah it works. Thanks
– vikrant rana
Nov 13 '18 at 8:49
@Alla Tarighati- Thanks for your help
– vikrant rana
Nov 15 '18 at 13:02
Happy to hear that it helped :) and thank you for writing a proper answer for it.
– Alla Tarighati
Nov 15 '18 at 16:00
add a comment |
I have a CSV file with the details as shown below:
emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000
I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:
+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+
I can load this file into dataframe and can filter using below expression for each column
newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))
I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.
python-3.x csv pyspark
I have a CSV file with the details as shown below:
emp_id,emp_name,emp_city,emp_salary
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
9,ABC ,ROBERT ,10000
9,XYZ ,ROBERTGURGAON,70000
I want to skip the lines if it contains keyword "ROBERT" and the expecting output as:
+------+--------------------+-------------+----------+
|emp_id| emp_name| emp_city|emp_salary|
+------+--------------------+-------------+----------+
| 1|VIKRANT SINGH RAN...| NOIDA | 10000|
| 3|GOVIND NIMBHAL ...| DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...| GURGAON | 50000|
| 4|ABHIJAN SINHA ...| SAKET | 65000|
| 5|SUPER DEVELOPER ...| USA | 50000|
| 6|RAJAT TYAGI ...| UP | 65000|
| 7|AJAY SHARMA ...| NOIDA | 70000|
| 8|SIDDHARTH BASU ...| SAKET | 72000|
+------+--------------------+-------------+----------+
I can load this file into dataframe and can filter using below expression for each column
newdf = emp_df.where(~ col("emp_city").like("ROBERT%"))
I am looking for some solution so that I can filter it before loading it into dataframe and need not to traversed all the columns to find the specific string.
python-3.x csv pyspark
python-3.x csv pyspark
asked Nov 13 '18 at 7:17
vikrant ranavikrant rana
6221215
6221215
1
Have you tried to read it tordd
andfilter
the result usinglambda
function before creating dataframe? Something like this:rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())
– Alla Tarighati
Nov 13 '18 at 7:31
Yeah it works. Thanks
– vikrant rana
Nov 13 '18 at 8:49
@Alla Tarighati- Thanks for your help
– vikrant rana
Nov 15 '18 at 13:02
Happy to hear that it helped :) and thank you for writing a proper answer for it.
– Alla Tarighati
Nov 15 '18 at 16:00
add a comment |
1
Have you tried to read it tordd
andfilter
the result usinglambda
function before creating dataframe? Something like this:rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())
– Alla Tarighati
Nov 13 '18 at 7:31
Yeah it works. Thanks
– vikrant rana
Nov 13 '18 at 8:49
@Alla Tarighati- Thanks for your help
– vikrant rana
Nov 15 '18 at 13:02
Happy to hear that it helped :) and thank you for writing a proper answer for it.
– Alla Tarighati
Nov 15 '18 at 16:00
1
1
Have you tried to read it to
rdd
and filter
the result using lambda
function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())
– Alla Tarighati
Nov 13 '18 at 7:31
Have you tried to read it to
rdd
and filter
the result using lambda
function before creating dataframe? Something like this: rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())
– Alla Tarighati
Nov 13 '18 at 7:31
Yeah it works. Thanks
– vikrant rana
Nov 13 '18 at 8:49
Yeah it works. Thanks
– vikrant rana
Nov 13 '18 at 8:49
@Alla Tarighati- Thanks for your help
– vikrant rana
Nov 15 '18 at 13:02
@Alla Tarighati- Thanks for your help
– vikrant rana
Nov 15 '18 at 13:02
Happy to hear that it helped :) and thank you for writing a proper answer for it.
– Alla Tarighati
Nov 15 '18 at 16:00
Happy to hear that it helped :) and thank you for writing a proper answer for it.
– Alla Tarighati
Nov 15 '18 at 16:00
add a comment |
1 Answer
1
active
oldest
votes
I was able to filter it using RDD.
textdata = sc.textFile(PATH_TO_FILE)
header=textdata.first();
textnewdata = textdata.filter(lambda x:x != header)
newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)
[u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
u'3,GOVIND NIMBHAL ,DWARKA ,92000',
u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
u'4,ABHIJAN SINHA ,SAKET ,65000',
u'5,SUPER DEVELOPER ,USA ,50000',
u'6,RAJAT TYAGI ,UP ,65000',
u'7,AJAY SHARMA ,NOIDA ,70000',
u'8,SIDDHARTH BASU ,SAKET ,72000']
newsplitRDD = newRDD.map(lambda l: l.split(","))
newDF = newsplitRDD.toDF()
>>> newDF.show();
+---+--------------------+--------+-----+
| _1| _2| _3| _4|
+---+--------------------+--------+-----+
| 1|VIKRANT SINGH RAN...|NOIDA |10000|
| 3|GOVIND NIMBHAL ...|DWARKA |92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
| 4|ABHIJAN SINHA ...|SAKET |65000|
| 5|SUPER DEVELOPER ...|USA |50000|
| 6|RAJAT TYAGI ...|UP |65000|
| 7|AJAY SHARMA ...|NOIDA |70000|
| 8|SIDDHARTH BASU ...|SAKET |72000|
+---+--------------------+--------+-----+
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275750%2fskip-lines-from-csv-file-if-it-contains-specific-keyword-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I was able to filter it using RDD.
textdata = sc.textFile(PATH_TO_FILE)
header=textdata.first();
textnewdata = textdata.filter(lambda x:x != header)
newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)
[u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
u'3,GOVIND NIMBHAL ,DWARKA ,92000',
u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
u'4,ABHIJAN SINHA ,SAKET ,65000',
u'5,SUPER DEVELOPER ,USA ,50000',
u'6,RAJAT TYAGI ,UP ,65000',
u'7,AJAY SHARMA ,NOIDA ,70000',
u'8,SIDDHARTH BASU ,SAKET ,72000']
newsplitRDD = newRDD.map(lambda l: l.split(","))
newDF = newsplitRDD.toDF()
>>> newDF.show();
+---+--------------------+--------+-----+
| _1| _2| _3| _4|
+---+--------------------+--------+-----+
| 1|VIKRANT SINGH RAN...|NOIDA |10000|
| 3|GOVIND NIMBHAL ...|DWARKA |92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
| 4|ABHIJAN SINHA ...|SAKET |65000|
| 5|SUPER DEVELOPER ...|USA |50000|
| 6|RAJAT TYAGI ...|UP |65000|
| 7|AJAY SHARMA ...|NOIDA |70000|
| 8|SIDDHARTH BASU ...|SAKET |72000|
+---+--------------------+--------+-----+
add a comment |
I was able to filter it using RDD.
textdata = sc.textFile(PATH_TO_FILE)
header=textdata.first();
textnewdata = textdata.filter(lambda x:x != header)
newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)
[u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
u'3,GOVIND NIMBHAL ,DWARKA ,92000',
u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
u'4,ABHIJAN SINHA ,SAKET ,65000',
u'5,SUPER DEVELOPER ,USA ,50000',
u'6,RAJAT TYAGI ,UP ,65000',
u'7,AJAY SHARMA ,NOIDA ,70000',
u'8,SIDDHARTH BASU ,SAKET ,72000']
newsplitRDD = newRDD.map(lambda l: l.split(","))
newDF = newsplitRDD.toDF()
>>> newDF.show();
+---+--------------------+--------+-----+
| _1| _2| _3| _4|
+---+--------------------+--------+-----+
| 1|VIKRANT SINGH RAN...|NOIDA |10000|
| 3|GOVIND NIMBHAL ...|DWARKA |92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
| 4|ABHIJAN SINHA ...|SAKET |65000|
| 5|SUPER DEVELOPER ...|USA |50000|
| 6|RAJAT TYAGI ...|UP |65000|
| 7|AJAY SHARMA ...|NOIDA |70000|
| 8|SIDDHARTH BASU ...|SAKET |72000|
+---+--------------------+--------+-----+
add a comment |
I was able to filter it using RDD.
textdata = sc.textFile(PATH_TO_FILE)
header=textdata.first();
textnewdata = textdata.filter(lambda x:x != header)
newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)
[u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
u'3,GOVIND NIMBHAL ,DWARKA ,92000',
u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
u'4,ABHIJAN SINHA ,SAKET ,65000',
u'5,SUPER DEVELOPER ,USA ,50000',
u'6,RAJAT TYAGI ,UP ,65000',
u'7,AJAY SHARMA ,NOIDA ,70000',
u'8,SIDDHARTH BASU ,SAKET ,72000']
newsplitRDD = newRDD.map(lambda l: l.split(","))
newDF = newsplitRDD.toDF()
>>> newDF.show();
+---+--------------------+--------+-----+
| _1| _2| _3| _4|
+---+--------------------+--------+-----+
| 1|VIKRANT SINGH RAN...|NOIDA |10000|
| 3|GOVIND NIMBHAL ...|DWARKA |92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
| 4|ABHIJAN SINHA ...|SAKET |65000|
| 5|SUPER DEVELOPER ...|USA |50000|
| 6|RAJAT TYAGI ...|UP |65000|
| 7|AJAY SHARMA ...|NOIDA |70000|
| 8|SIDDHARTH BASU ...|SAKET |72000|
+---+--------------------+--------+-----+
I was able to filter it using RDD.
textdata = sc.textFile(PATH_TO_FILE)
header=textdata.first();
textnewdata = textdata.filter(lambda x:x != header)
newRDD = textnewdata.filter(lambda row : 'ROBERT' not in row)
[u'1,VIKRANT SINGH RANA ,NOIDA ,10000',
u'3,GOVIND NIMBHAL ,DWARKA ,92000',
u'2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000',
u'4,ABHIJAN SINHA ,SAKET ,65000',
u'5,SUPER DEVELOPER ,USA ,50000',
u'6,RAJAT TYAGI ,UP ,65000',
u'7,AJAY SHARMA ,NOIDA ,70000',
u'8,SIDDHARTH BASU ,SAKET ,72000']
newsplitRDD = newRDD.map(lambda l: l.split(","))
newDF = newsplitRDD.toDF()
>>> newDF.show();
+---+--------------------+--------+-----+
| _1| _2| _3| _4|
+---+--------------------+--------+-----+
| 1|VIKRANT SINGH RAN...|NOIDA |10000|
| 3|GOVIND NIMBHAL ...|DWARKA |92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON |50000|
| 4|ABHIJAN SINHA ...|SAKET |65000|
| 5|SUPER DEVELOPER ...|USA |50000|
| 6|RAJAT TYAGI ...|UP |65000|
| 7|AJAY SHARMA ...|NOIDA |70000|
| 8|SIDDHARTH BASU ...|SAKET |72000|
+---+--------------------+--------+-----+
answered Nov 13 '18 at 8:51
vikrant ranavikrant rana
6221215
6221215
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53275750%2fskip-lines-from-csv-file-if-it-contains-specific-keyword-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Have you tried to read it to
rdd
andfilter
the result usinglambda
function before creating dataframe? Something like this:rdd.filter(lambda row : 'ROBERT' not in row).toDF(rdd.first())
– Alla Tarighati
Nov 13 '18 at 7:31
Yeah it works. Thanks
– vikrant rana
Nov 13 '18 at 8:49
@Alla Tarighati- Thanks for your help
– vikrant rana
Nov 15 '18 at 13:02
Happy to hear that it helped :) and thank you for writing a proper answer for it.
– Alla Tarighati
Nov 15 '18 at 16:00