The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms.
The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.
This is just in terms of definition.
Now, let's understand what machine learning is first.
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.
The primary aim is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
ML has various algorithms(learning methods) categorized as
- Supervised machine learning algorithms
- Unsupervised machine learning algorithms
- Semi-supervised machine learning algorithms
Machine learning enables the analysis of massive quantities of data. While it generally delivers faster, more accurate results in order to identify profitable opportunities or dangerous risks, it may also require additional time and resources to train it properly. Combining machine learning with AI and cognitive technologies can make it even more effective in processing large volumes of information.
Now coming back to the main topic bag of words:
*As we know it’s a part of machine learning so it was important to give an introduction to basics of ML.
*Now as we know ML algorithms can be tough to write and understand sometimes. So, there can be some problems occurring with the text written.
The Problem with Text
Problem with modeling text is that it is messy, and techniques like machine learning algorithms prefer well defined fixed-length inputs and outputs.
Machine learning algorithms cannot work with raw text directly; the text must be converted into numbers. Specifically, vectors of numbers.
In language processing, the vectors x are derived from textual data, in order to reflect various linguistic properties of the text.
This is called feature extraction or feature encoding i.e. a process of dimensionality reduction by which an initial set of raw data is reduced to more manageable groups for processing. The reduction of the data and the machine’s efforts in building variable combinations (features) facilitate the speed of learning and generalization steps in the machine learning process.
A popular and simple method of feature extraction with text data is called the bag-of-words model of text.
*A bag-of-words model, or BOW, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.
*The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.
*A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:
- A vocabulary of known words.
- A measure of the presence of known words.
*It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.
*The bag-of-words can be as simple or complex as you like. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.
**Lets consider an example to understand this model.
Collection of Data:-
This was the best of times,
This was the worst of times,
This was the age of wisdom,
This was the age of foolishness,
For this small example, let’s treat each above written line as a separate “document” and the 4 lines as our entire corpus of documents.
Design the Vocabulary:-
Now we can make a list of all of the words in our model vocabulary.
The unique words here (ignoring case and punctuation) are:
That is a vocabulary of 10 words from a corpus containing 24 words.
Mark that the repeated words are written only once in the vocabulary.
Step 3: Create Document Vectors
The next step is to score the words in each document.
The objective is to turn each document of free text into a vector that we can
use as input or output for a machine learning model.
Because we know the vocabulary has 10 words, we can use a fixed-length
document representation of 10, with one position in the vector to score each word.
The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present.
Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first document (“It was the best of times“) and convert it into a binary vector.
The scoring of the document would look as follows:
- “it” = 1
- “was” = 1
- “the” = 1
- “best” = 1
- “of” = 1
- “times” = 1
- “worst” = 0
- “age” = 0
- “wisdom” = 0
- “foolishness” = 0
As a binary vector, this would look as follows:
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
The other three documents would look as follows:
"it this was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]
"it this was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it this was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]
All ordering of the words is nominally discarded and we have a consistent way of extracting features from any document in our corpus, ready for use in modeling.
New documents that overlap with the vocabulary of known words, but may contain words outside of the vocabulary, can still be encoded, where only the occurrence of known words are scored and unknown words are ignored.
You can see how this might naturally scale to large vocabularies and larger documents.
Here, the length of the document vector is equal to the number of known words.
*Understanding how have wee given binary value to the words in the above example:
Once a vocabulary has been chosen, the occurrence of words in example documents needs to be scored.
In the worked example, we have already seen one very simple approach to scoring: a binary scoring of the presence or absence of words.
Some additional simple scoring methods include:
- Counts. Count the number of times each word appears in a document
- Frequencies. Calculate the frequency that each word appears in a document out of all the words in the document.
Another example can be seen through the below given imae.
There are always some errors in a model. Bag of words off course have some.
Limitations of Bag-of-Words
The bag-of-words model is very simple to understand and implement and offers a lot of flexibility for customization on your specific text data.
It has been used with great success on prediction problems like language modeling and documentation classification.
Nevertheless, it suffers from some shortcomings, such as:
- Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
- Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.
- Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.