Why does Caret Naive Bayes return (converted from warning) predictions failed for Resample01?










0















I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:



description class
To suspend temporarily the duty on Fast Magenta 2 Stage. 18
To suspend temporarily the duty on Fast Black 286 Stage. 18
To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
To extend the Irish Peace Process Cultural and Training Program. 19


I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%



library(tidyverse)
library(tidytext)
library(caret)
library(tm)
library(RTextTools)
library(e1071)

data(USCongress, package = "RTextTools")

#Change the data-frame to a tibble data-frame and change the datatype of
#the 'text' field from Factor to Character, and retain only the 'text'
#field and 'major' fields
congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
mutate(text = as.character(text))

#Rename columns
colnames(congress_df) <- c("description", "class")

#Create corpus
congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

#Create document term matrix
congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
list(tolower = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE,
stemming = TRUE))
#Remove infrequent terms
min_term_pthreshold <- 0.001 #as a percentage
min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

#Create vector of most frequent words
freq_words <- congress_dtm %>%
tm::findFreqTerms(lowfreq = min_freq)

#Only keep frequent terms
congress_dtm <- congress_dtm[, freq_words]

#Training & Test set
train_split <- 0.8
num_records <- nrow(congress_dtm)
train_records <- floor(num_records * train_split)
test_records <- train_records + 1

congress_dtm_train <- congress_dtm[1 : train_records, ]
congress_dtm_test <- congress_dtm[test_records : num_records, ]

#Training & Test Label
congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
congress_test_labels <- as.factor(congress_df[test_records : num_records,
]$class)

convert_values <- function(x)
x <- ifelse(x > 0, "Yes", "No")
x <- as.factor(x)


congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)


If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:



congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
congress_train_labels)

#Make predictions on test set
congress_test_pred <- predict(congress_classifier, congress_dtm_test)

#Create confusion matrix
confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
= congress_test_labels, dnn = c("Prediction", "Actual"))

confusion_mat


Overall Statistics



 Accuracy : 0.7528 
95% CI : (0.7231, 0.7808)
No Information Rate : 0.1607
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.7313
Mcnemar's Test P-Value : NA


However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:



model <- caret::train(as.matrix(congress_dtm_train), 
congress_train_labels,'nb')


or



model <- caret::train(as.matrix(congress_dtm_train), 
congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)


In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:



  • task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
    (converted from warning) NAs introduced by coercion


  • task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
    (converted from warning) NAs introduced by coercion


  • task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds


Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??



Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!



Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?










share|improve this question


























    0















    I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:



    description class
    To suspend temporarily the duty on Fast Magenta 2 Stage. 18
    To suspend temporarily the duty on Fast Black 286 Stage. 18
    To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
    To extend the Irish Peace Process Cultural and Training Program. 19


    I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%



    library(tidyverse)
    library(tidytext)
    library(caret)
    library(tm)
    library(RTextTools)
    library(e1071)

    data(USCongress, package = "RTextTools")

    #Change the data-frame to a tibble data-frame and change the datatype of
    #the 'text' field from Factor to Character, and retain only the 'text'
    #field and 'major' fields
    congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
    mutate(text = as.character(text))

    #Rename columns
    colnames(congress_df) <- c("description", "class")

    #Create corpus
    congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

    #Create document term matrix
    congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
    list(tolower = TRUE,
    removeNumbers = TRUE,
    stopwords = TRUE,
    removePunctuation = TRUE,
    stemming = TRUE))
    #Remove infrequent terms
    min_term_pthreshold <- 0.001 #as a percentage
    min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

    #Create vector of most frequent words
    freq_words <- congress_dtm %>%
    tm::findFreqTerms(lowfreq = min_freq)

    #Only keep frequent terms
    congress_dtm <- congress_dtm[, freq_words]

    #Training & Test set
    train_split <- 0.8
    num_records <- nrow(congress_dtm)
    train_records <- floor(num_records * train_split)
    test_records <- train_records + 1

    congress_dtm_train <- congress_dtm[1 : train_records, ]
    congress_dtm_test <- congress_dtm[test_records : num_records, ]

    #Training & Test Label
    congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
    congress_test_labels <- as.factor(congress_df[test_records : num_records,
    ]$class)

    convert_values <- function(x)
    x <- ifelse(x > 0, "Yes", "No")
    x <- as.factor(x)


    congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
    congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)


    If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:



    congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
    congress_train_labels)

    #Make predictions on test set
    congress_test_pred <- predict(congress_classifier, congress_dtm_test)

    #Create confusion matrix
    confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
    = congress_test_labels, dnn = c("Prediction", "Actual"))

    confusion_mat


    Overall Statistics



     Accuracy : 0.7528 
    95% CI : (0.7231, 0.7808)
    No Information Rate : 0.1607
    P-Value [Acc > NIR] : < 2.2e-16

    Kappa : 0.7313
    Mcnemar's Test P-Value : NA


    However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:



    model <- caret::train(as.matrix(congress_dtm_train), 
    congress_train_labels,'nb')


    or



    model <- caret::train(as.matrix(congress_dtm_train), 
    congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)


    In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:



    • task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
      (converted from warning) NAs introduced by coercion


    • task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
      (converted from warning) NAs introduced by coercion


    • task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds


    Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??



    Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!



    Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?










    share|improve this question
























      0












      0








      0








      I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:



      description class
      To suspend temporarily the duty on Fast Magenta 2 Stage. 18
      To suspend temporarily the duty on Fast Black 286 Stage. 18
      To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
      To extend the Irish Peace Process Cultural and Training Program. 19


      I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%



      library(tidyverse)
      library(tidytext)
      library(caret)
      library(tm)
      library(RTextTools)
      library(e1071)

      data(USCongress, package = "RTextTools")

      #Change the data-frame to a tibble data-frame and change the datatype of
      #the 'text' field from Factor to Character, and retain only the 'text'
      #field and 'major' fields
      congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
      mutate(text = as.character(text))

      #Rename columns
      colnames(congress_df) <- c("description", "class")

      #Create corpus
      congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

      #Create document term matrix
      congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
      list(tolower = TRUE,
      removeNumbers = TRUE,
      stopwords = TRUE,
      removePunctuation = TRUE,
      stemming = TRUE))
      #Remove infrequent terms
      min_term_pthreshold <- 0.001 #as a percentage
      min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

      #Create vector of most frequent words
      freq_words <- congress_dtm %>%
      tm::findFreqTerms(lowfreq = min_freq)

      #Only keep frequent terms
      congress_dtm <- congress_dtm[, freq_words]

      #Training & Test set
      train_split <- 0.8
      num_records <- nrow(congress_dtm)
      train_records <- floor(num_records * train_split)
      test_records <- train_records + 1

      congress_dtm_train <- congress_dtm[1 : train_records, ]
      congress_dtm_test <- congress_dtm[test_records : num_records, ]

      #Training & Test Label
      congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
      congress_test_labels <- as.factor(congress_df[test_records : num_records,
      ]$class)

      convert_values <- function(x)
      x <- ifelse(x > 0, "Yes", "No")
      x <- as.factor(x)


      congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
      congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)


      If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:



      congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
      congress_train_labels)

      #Make predictions on test set
      congress_test_pred <- predict(congress_classifier, congress_dtm_test)

      #Create confusion matrix
      confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
      = congress_test_labels, dnn = c("Prediction", "Actual"))

      confusion_mat


      Overall Statistics



       Accuracy : 0.7528 
      95% CI : (0.7231, 0.7808)
      No Information Rate : 0.1607
      P-Value [Acc > NIR] : < 2.2e-16

      Kappa : 0.7313
      Mcnemar's Test P-Value : NA


      However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:



      model <- caret::train(as.matrix(congress_dtm_train), 
      congress_train_labels,'nb')


      or



      model <- caret::train(as.matrix(congress_dtm_train), 
      congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)


      In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:



      • task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
        (converted from warning) NAs introduced by coercion


      • task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
        (converted from warning) NAs introduced by coercion


      • task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds


      Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??



      Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!



      Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?










      share|improve this question














      I am using the USCongress dataset from the R package RTextTools to experiment with text classification using Naive Bayes. Of the data fields from the USCongress dataset, I create a new tibble dataframe (which I name congress_df) retaining only the 'text' and 'major' fields, which I relabel as "description" and "class". A snapshot of the congress_df data is therefore:



      description class
      To suspend temporarily the duty on Fast Magenta 2 Stage. 18
      To suspend temporarily the duty on Fast Black 286 Stage. 18
      To repeal subtitle B of title III of the Gramm-Leach-Bliley Act. 15
      To extend the Irish Peace Process Cultural and Training Program. 19


      I have written the below core code which appears to work reasonably well when I use the e1071 Naive Bayes function - accuracy of 75.28%



      library(tidyverse)
      library(tidytext)
      library(caret)
      library(tm)
      library(RTextTools)
      library(e1071)

      data(USCongress, package = "RTextTools")

      #Change the data-frame to a tibble data-frame and change the datatype of
      #the 'text' field from Factor to Character, and retain only the 'text'
      #field and 'major' fields
      congress_df <- tibble::as_tibble(USCongress[,c(6, 5)]) %>%
      mutate(text = as.character(text))

      #Rename columns
      colnames(congress_df) <- c("description", "class")

      #Create corpus
      congress_corpus <- tm::VCorpus(VectorSource(congress_df$description))

      #Create document term matrix
      congress_dtm <- tm::DocumentTermMatrix(congress_corpus, control =
      list(tolower = TRUE,
      removeNumbers = TRUE,
      stopwords = TRUE,
      removePunctuation = TRUE,
      stemming = TRUE))
      #Remove infrequent terms
      min_term_pthreshold <- 0.001 #as a percentage
      min_freq <- round(congress_dtm$nrow*min_term_pthreshold, 0)

      #Create vector of most frequent words
      freq_words <- congress_dtm %>%
      tm::findFreqTerms(lowfreq = min_freq)

      #Only keep frequent terms
      congress_dtm <- congress_dtm[, freq_words]

      #Training & Test set
      train_split <- 0.8
      num_records <- nrow(congress_dtm)
      train_records <- floor(num_records * train_split)
      test_records <- train_records + 1

      congress_dtm_train <- congress_dtm[1 : train_records, ]
      congress_dtm_test <- congress_dtm[test_records : num_records, ]

      #Training & Test Label
      congress_train_labels <- as.factor(congress_df[1 : train_records, ]$class)
      congress_test_labels <- as.factor(congress_df[test_records : num_records,
      ]$class)

      convert_values <- function(x)
      x <- ifelse(x > 0, "Yes", "No")
      x <- as.factor(x)


      congress_dtm_train <- apply(congress_dtm_train, MARGIN = 2, convert_values)
      congress_dtm_test <- apply(congress_dtm_test, MARGIN = 2, convert_values)


      If I now apply the naiveBayes function from the e1071 package I obtain a reasonable result:



      congress_classifier <- e1071::naiveBayes(congress_dtm_train, 
      congress_train_labels)

      #Make predictions on test set
      congress_test_pred <- predict(congress_classifier, congress_dtm_test)

      #Create confusion matrix
      confusion_mat <- caret::confusionMatrix(data = congress_test_pred, reference
      = congress_test_labels, dnn = c("Prediction", "Actual"))

      confusion_mat


      Overall Statistics



       Accuracy : 0.7528 
      95% CI : (0.7231, 0.7808)
      No Information Rate : 0.1607
      P-Value [Acc > NIR] : < 2.2e-16

      Kappa : 0.7313
      Mcnemar's Test P-Value : NA


      However, if I try to apply the naive Bayes model from the caret package it doesn't work. I have tried all sorts of variations to the following:



      model <- caret::train(as.matrix(congress_dtm_train), 
      congress_train_labels,'nb')


      or



      model <- caret::train(as.matrix(congress_dtm_train), 
      congress_train_labels,'nb',trControl=trainControl(method='cv',number=10)


      In all my attempts I can't get the caret approach to work. Depending on what adjustments I try to make to the above lines I obtain different errors, including:



      • task 1 failed - "(converted from warning) predictions failed for Resample01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
        (converted from warning) NAs introduced by coercion


      • task 1 failed - "(converted from warning) predictions failed for Fold01: usekernel= TRUE, fL=0, adjust=1 Error in data.matrix(newdata) :
        (converted from warning) NAs introduced by coercion


      • task 1 failed - "(converted from warning) model fit failed for Fold01: usekernel=FALSE, fL=1, adjust=FALSE Error in [.default`(x, , 2) : subscript out of bounds


      Whilst I could just stick with the e1071 approach, I want to know what's wrong with the Caret approach as surely they should both give the same results??



      Also, more generally, can anybody suggest any improvements or modifications to my code - always happy to learn!



      Lastly, what ML approaches would people suggest to try and improve the accuracy? Is it possible to obtain an accuracy higher than say 85%?







      r r-caret text-classification naivebayes






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 13 '18 at 10:47









      user2948208user2948208

      6228




      6228






















          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279319%2fwhy-does-caret-naive-bayes-return-converted-from-warning-predictions-failed-fo%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53279319%2fwhy-does-caret-naive-bayes-return-converted-from-warning-predictions-failed-fo%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

          Syphilis

          Darth Vader #20