Для ботов

## Naive Bayes for text classification in Python

Naive Bayes classification makes use of Bayes theorem to determine how probable it is that an item is a member of a category. When we follow these rules, some words tend to be correlated with other words. I chose sub-disciplines that are distinct, but that have a significant amount of overlap: Epistemology and Ethics. Both employ the language of justification and reasons. They also intersect frequently e. In the end, Naive Bayes performed surprisingly well in classifying these documents. What is Naive Bayes Classification? Bayes Theorem. Bayes theorem tells us that the probability of a hypothesis given some evidence is equal to the probability of the hypothesis multiplied by the probability of the evidence given the hypothesis, then divided by the probability of the evidence. Since classification tasks involve comparing two or more hypotheses, we can use the ratio form of Bayes theorem, which compares the numerators of the above formula for Bayes aficionados: the prior times the likelihood for each hypothesis:. Since there are many words in a document, the formula becomes:. A demonstration: Classifying philosophy papers by their abstracts. The documents I will attempt to classify are article abstracts from a database called PhilPapers. Philpapers is a comprehensive database of research in philosophy. Since this database is curated by legions of topic editors, we can be reasonably confident that the document classifications given on the site are correct. I selected two philosophy subdisciplines from the site for a binary Naive Bayes classifier: ethics or epistemology. From each subdiscipline, I selected a topic. The head and tail of my initial DataFrame looked like this:. To run a Naive Bayes classifier in Scikit Learn, the categories must be numeric, so I assigned the label 1 to all ethics abstracts and the label 0 to all epistemology abstracts that is, not ethics :. Split data into training and testing sets. Convert abstracts into word count vectors. A Naive Bayes classifier needs to be able to calculate how many times each word appears in each document and how many times it appears in each category. To make this possible, the data needs to look something like this:. Each row represents a document, and each column represents a word. CountVectorizer creates a vector of word counts for each abstract to form a matrix. Each index corresponds to a word and every word appearing in the abstracts is represented. For details, see the documentation. Fit the model and make predictions. Check the results. To understand these scores, it helps to see a breakdown:. The accuracy score tells us: out of all of the identifications we made, how many were correct? The precision score tells us: out of all of the ethics identifications we made, how many were correct? The recall score tells us: out of all of the true cases of ethics, how many did we identify correctly? To investigate the incorrect labels, we can put the actual labels and the predicted labels side-by-side in a DataFrame. Overall, my Naive Bayes classifier performed well on the test set. There were only three mismatched labels out of Recommended reading:. Sign in. Naive Bayes Document Classification in Python.