Why does Caret Naive Bayes return (converted from warning) predictions failed for Resample01?

I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:

description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19

I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%

library(tidyverse)
library(tidytext)
library(caret) 
library(tm)
library(RTextTools)
library(e1071)

data(USCongress, package = "RTextTools")

#Change the data-frame to a tibble data-frame and change the datatype of 
#the 'text' field from Factor to Character, and retain only the 'text' 
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
 mutate(text = as.character(text))

#Rename columns
colnames(congress_df) <- c("description", "class")

#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control = 
 list(tolower = TRUE,
 removeNumbers = TRUE,
 stopwords = TRUE,
 removePunctuation = TRUE,
 stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

#Create vector of most frequent words
freq_words <- congress_dtm %>% 
 tm::findFreqTerms(lowfreq = min_freq)

#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]

#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1

congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]

#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records, 
]$class)

convert_values <- function(x) 
 x <- ifelse(x > 0, "Yes", "No")
 x <- as.factor(x)
 

congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)

If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:

congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
congress_train_labels)

#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)

#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference 
 = congress_test_labels, dnn = c("Prediction", "Actual"))

confusion_mat

Overall Statistics

 Accuracy : 0.7528 
 95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607 
P-Value [Acc > NIR] : < 2.2e-16 

 Kappa : 0.7313 
Mcnemar's Test P-Value : NA

However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb')

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)

In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:

task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds

Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??

Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!

Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?

asked Nov 13 '18 at 10:47

user2948208

6228

add a comment |

description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19

I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%

library(tidyverse)
library(tidytext)
library(caret) 
library(tm)
library(RTextTools)
library(e1071)

data(USCongress, package = "RTextTools")

#Change the data-frame to a tibble data-frame and change the datatype of 
#the 'text' field from Factor to Character, and retain only the 'text' 
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
 mutate(text = as.character(text))

#Rename columns
colnames(congress_df) <- c("description", "class")

#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control = 
 list(tolower = TRUE,
 removeNumbers = TRUE,
 stopwords = TRUE,
 removePunctuation = TRUE,
 stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

#Create vector of most frequent words
freq_words <- congress_dtm %>% 
 tm::findFreqTerms(lowfreq = min_freq)

#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]

#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1

congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]

#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records, 
]$class)

convert_values <- function(x) 
 x <- ifelse(x > 0, "Yes", "No")
 x <- as.factor(x)
 

congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)

If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:

congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
congress_train_labels)

#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)

#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference 
 = congress_test_labels, dnn = c("Prediction", "Actual"))

confusion_mat

Overall Statistics

 Accuracy : 0.7528 
 95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607 
P-Value [Acc > NIR] : < 2.2e-16 

 Kappa : 0.7313 
Mcnemar's Test P-Value : NA

However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb')

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)

In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:

task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds

Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??

Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!

Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?

asked Nov 13 '18 at 10:47

user2948208

6228

add a comment |

description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19

I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%

library(tidyverse)
library(tidytext)
library(caret) 
library(tm)
library(RTextTools)
library(e1071)

data(USCongress, package = "RTextTools")

#Change the data-frame to a tibble data-frame and change the datatype of 
#the 'text' field from Factor to Character, and retain only the 'text' 
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
 mutate(text = as.character(text))

#Rename columns
colnames(congress_df) <- c("description", "class")

#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control = 
 list(tolower = TRUE,
 removeNumbers = TRUE,
 stopwords = TRUE,
 removePunctuation = TRUE,
 stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

#Create vector of most frequent words
freq_words <- congress_dtm %>% 
 tm::findFreqTerms(lowfreq = min_freq)

#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]

#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1

congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]

#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records, 
]$class)

convert_values <- function(x) 
 x <- ifelse(x > 0, "Yes", "No")
 x <- as.factor(x)
 

congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)

If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:

congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
congress_train_labels)

#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)

#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference 
 = congress_test_labels, dnn = c("Prediction", "Actual"))

confusion_mat

Overall Statistics

 Accuracy : 0.7528 
 95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607 
P-Value [Acc > NIR] : < 2.2e-16 

 Kappa : 0.7313 
Mcnemar's Test P-Value : NA

However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb')

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)

In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:

task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds

Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??

Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!

Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?

asked Nov 13 '18 at 10:47

user2948208

6228

description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19

I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%

library(tidyverse)
library(tidytext)
library(caret) 
library(tm)
library(RTextTools)
library(e1071)

data(USCongress, package = "RTextTools")

#Change the data-frame to a tibble data-frame and change the datatype of 
#the 'text' field from Factor to Character, and retain only the 'text' 
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
 mutate(text = as.character(text))

#Rename columns
colnames(congress_df) <- c("description", "class")

#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control = 
 list(tolower = TRUE,
 removeNumbers = TRUE,
 stopwords = TRUE,
 removePunctuation = TRUE,
 stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

#Create vector of most frequent words
freq_words <- congress_dtm %>% 
 tm::findFreqTerms(lowfreq = min_freq)

#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]

#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1

congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]

#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records, 
]$class)

convert_values <- function(x) 
 x <- ifelse(x > 0, "Yes", "No")
 x <- as.factor(x)
 

congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)

If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:

congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
congress_train_labels)

#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)

#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference 
 = congress_test_labels, dnn = c("Prediction", "Actual"))

confusion_mat

Overall Statistics

 Accuracy : 0.7528 
 95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607 
P-Value [Acc > NIR] : < 2.2e-16 

 Kappa : 0.7313 
Mcnemar's Test P-Value : NA

However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb')

model <- caret::train(as.matrix(congress_dtm_train), 
 congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)

In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:

task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
(converted from warning) NAs introduced by coercion

task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds

Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??

Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!

Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?

r r-caret text-classification naivebayes

asked Nov 13 '18 at 10:47

user2948208

6228

asked Nov 13 '18 at 10:47

user2948208

6228

asked Nov 13 '18 at 10:47

user2948208

6228

asked Nov 13 '18 at 10:47

user2948208

6228

asked Nov 13 '18 at 10:47

user2948208

6228

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279319%2fwhy-does-caret-naive-bayes-return-converted-from-warning-predictions-failed-fo%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb