Why does Caret Naive Bayes return (converted from warning) predictions failed for Resample01?
I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:
description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19
I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%
library(tidyverse)
library(tidytext)
library(caret)
library(tm)
library(RTextTools)
library(e1071)
data(USCongress, package = "RTextTools")
#Change the data-frame to a tibble data-frame and change the datatype of
#the 'text' field from Factor to Character, and retain only the 'text'
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
mutate(text = as.character(text))
#Rename columns
colnames(congress_df) <- c("description", "class")
#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))
#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)
#Create vector of most frequent words
freq_words <- congress_dtm %>%
tm::findFreqTerms(lowfreq = min_freq)
#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]
#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1
congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]
#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records,
]$class)
convert_values <- function(x)
x <- ifelse(x > 0, "Yes", "No")
x <- as.factor(x)
congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)
If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:
congress_classifier <- e1071::naiveBayes(congress_dtm_train,
congress_train_labels)
#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)
#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
= congress_test_labels, dnn = c("Prediction", "Actual"))
confusion_mat
Overall Statistics
Accuracy : 0.7528
95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7313
Mcnemar's Test P-Value : NA
However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb')
or
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)
In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:
task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds
Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??
Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!
Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?
r r-caret text-classification naivebayes
add a comment |
I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:
description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19
I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%
library(tidyverse)
library(tidytext)
library(caret)
library(tm)
library(RTextTools)
library(e1071)
data(USCongress, package = "RTextTools")
#Change the data-frame to a tibble data-frame and change the datatype of
#the 'text' field from Factor to Character, and retain only the 'text'
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
mutate(text = as.character(text))
#Rename columns
colnames(congress_df) <- c("description", "class")
#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))
#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)
#Create vector of most frequent words
freq_words <- congress_dtm %>%
tm::findFreqTerms(lowfreq = min_freq)
#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]
#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1
congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]
#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records,
]$class)
convert_values <- function(x)
x <- ifelse(x > 0, "Yes", "No")
x <- as.factor(x)
congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)
If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:
congress_classifier <- e1071::naiveBayes(congress_dtm_train,
congress_train_labels)
#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)
#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
= congress_test_labels, dnn = c("Prediction", "Actual"))
confusion_mat
Overall Statistics
Accuracy : 0.7528
95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7313
Mcnemar's Test P-Value : NA
However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb')
or
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)
In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:
task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds
Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??
Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!
Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?
r r-caret text-classification naivebayes
add a comment |
I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:
description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19
I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%
library(tidyverse)
library(tidytext)
library(caret)
library(tm)
library(RTextTools)
library(e1071)
data(USCongress, package = "RTextTools")
#Change the data-frame to a tibble data-frame and change the datatype of
#the 'text' field from Factor to Character, and retain only the 'text'
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
mutate(text = as.character(text))
#Rename columns
colnames(congress_df) <- c("description", "class")
#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))
#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)
#Create vector of most frequent words
freq_words <- congress_dtm %>%
tm::findFreqTerms(lowfreq = min_freq)
#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]
#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1
congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]
#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records,
]$class)
convert_values <- function(x)
x <- ifelse(x > 0, "Yes", "No")
x <- as.factor(x)
congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)
If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:
congress_classifier <- e1071::naiveBayes(congress_dtm_train,
congress_train_labels)
#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)
#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
= congress_test_labels, dnn = c("Prediction", "Actual"))
confusion_mat
Overall Statistics
Accuracy : 0.7528
95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7313
Mcnemar's Test P-Value : NA
However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb')
or
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)
In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:
task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds
Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??
Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!
Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?
r r-caret text-classification naivebayes
I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:
description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19
I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%
library(tidyverse)
library(tidytext)
library(caret)
library(tm)
library(RTextTools)
library(e1071)
data(USCongress, package = "RTextTools")
#Change the data-frame to a tibble data-frame and change the datatype of
#the 'text' field from Factor to Character, and retain only the 'text'
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
mutate(text = as.character(text))
#Rename columns
colnames(congress_df) <- c("description", "class")
#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))
#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)
#Create vector of most frequent words
freq_words <- congress_dtm %>%
tm::findFreqTerms(lowfreq = min_freq)
#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]
#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1
congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]
#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records,
]$class)
convert_values <- function(x)
x <- ifelse(x > 0, "Yes", "No")
x <- as.factor(x)
congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)
If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:
congress_classifier <- e1071::naiveBayes(congress_dtm_train,
congress_train_labels)
#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)
#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
= congress_test_labels, dnn = c("Prediction", "Actual"))
confusion_mat
Overall Statistics
Accuracy : 0.7528
95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7313
Mcnemar's Test P-Value : NA
However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb')
or
model <- caret::train(as.matrix(congress_dtm_train),
congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)
In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:
task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coerciontask 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds
Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??
Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!
Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?
r r-caret text-classification naivebayes
r r-caret text-classification naivebayes
asked Nov 13 '18 at 10:47
user2948208user2948208
6228
6228
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279319%2fwhy-does-caret-naive-bayes-return-converted-from-warning-predictions-failed-fo%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279319%2fwhy-does-caret-naive-bayes-return-converted-from-warning-predictions-failed-fo%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown