The Random Forest algorithm kinds a part of a household of ensemble machine studying algorithms and is a well-liked variation of bagged determination bushes. It additionally comes carried out within the OpenCV library.

On this tutorial, you’ll learn to apply OpenCV’s Random Forest algorithm for picture classification, beginning with a comparatively simpler banknote dataset after which testing the algorithm on OpenCV’s digits dataset.

After finishing this tutorial, you’ll know:

A number of of crucial traits of the Random Forest algorithm.

The way to use the Random Forest algorithm for picture classification in OpenCV.

Kick-start your venture with my e-book Machine Studying in OpenCV. It gives self-study tutorials with working code.

Let’s get began.

## Tutorial Overview

This tutorial is split into two elements; they’re:

Reminder of How Random Forests Work

Making use of the Random Forest Algorithm to Picture Classification

Banknote Case Research

Digits Case Research

## Reminder of How Random Forests Work

The subject surrounding the Random Forest algorithm has already been defined properly in these tutorials by Jason Brownlee [1, 2], however let’s first begin with brushing up on a few of the most vital factors:

Random Forest is a kind of ensemble machine studying algorithm known as bagging. It’s a in style variation of bagged determination bushes.

A call tree is a branched mannequin that consists of a hierarchy of determination nodes, the place every determination node splits the information primarily based on a call rule. Coaching a call tree entails a grasping choice of the most effective break up factors (i.e., factors that divide the enter house greatest) by minimizing a value operate.

The grasping method by which determination bushes assemble their determination boundaries makes them vulnerable to excessive variance. Which means small adjustments within the coaching dataset can result in very completely different tree constructions and, in flip, mannequin predictions. If the choice tree just isn’t pruned, it’s going to additionally are likely to seize noise and outliers within the coaching knowledge. This sensitivity to the coaching knowledge makes determination bushes vulnerable to overfitting.

Bagged determination bushes handle this susceptibility by combining the predictions from a number of determination bushes, every educated on a bootstrap pattern of the coaching dataset created by sampling the dataset with substitute. The limitation of this method stems from the truth that the identical grasping method trains every tree, and a few samples could also be picked a number of occasions throughout coaching, making it very attainable that the bushes share related (or the identical) break up factors (therefore, leading to correlated bushes).

The Random Forest algorithm tries to mitigate this correlation by coaching every tree on a random subset of the coaching knowledge, created by randomly sampling the dataset with out substitute. On this method, the grasping algorithm can solely contemplate a set subset of the information to create the break up factors that make up every tree, which forces the bushes to be completely different.

Within the case of a classification drawback, each tree within the forest produces a prediction output, and the ultimate class label is recognized because the output that almost all of the bushes have produced. Within the case of regression, the ultimate output is the common of the outputs produced by all of the bushes.

## Making use of the Random Forest Algorithm to Picture Classification

### Banknote Case Research

We’ll first use the banknote dataset used on this tutorial.

The banknote dataset is a comparatively easy one which entails predicting a given banknote’s authenticity. The dataset incorporates 1,372 rows, with every row representing a function vector comprising 4 completely different measures extracted from a banknote {photograph}, plus its corresponding class label (genuine or not).

The values in every function vector correspond to the next:

Variance of Wavelet Remodeled picture (steady)

Skewness of Wavelet Remodeled picture (steady)

Kurtosis of Wavelet Remodeled picture (steady)

Entropy of picture (steady)

Class label (integer)

The dataset could also be downloaded from the UCI Machine Studying Repository.

As in Jason’s tutorial, we will load the dataset, convert its string numbers to floats, and partition it into coaching and testing units:

# Perform to load the dataset

def load_csv(filename):

file = open(filename, “rt”)

traces = reader(file)

dataset = checklist(traces)

return dataset

# Perform to transform a string column to drift

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float32(row[column].strip())

# Load the dataset from textual content file

knowledge = load_csv(‘Information/data_banknote_authentication.txt’)

# Convert the dataset string numbers to drift

for i in vary(len(knowledge[0])):

str_column_to_float(knowledge, i)

# Convert checklist to array

knowledge = array(knowledge)

# Separate the dataset samples from the bottom reality

samples = knowledge[:, :4]

goal = knowledge[:, -1, newaxis].astype(int32)

# Break up the information into coaching and testing units

x_train, x_test, y_train, y_test = ms.train_test_split(samples, goal, test_size=0.2, random_state=10)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

# Perform to load the dataset

def load_csv(filename):

file = open(filename, “rt”)

traces = reader(file)

dataset = checklist(traces)

return dataset

# Perform to transform a string column to drift

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float32(row[column].strip())

# Load the dataset from textual content file

knowledge = load_csv(‘Information/data_banknote_authentication.txt’)

# Convert the dataset string numbers to drift

for i in vary(len(knowledge[0])):

str_column_to_float(knowledge, i)

# Convert checklist to array

knowledge = array(knowledge)

# Separate the dataset samples from the bottom reality

samples = knowledge[:, :4]

goal = knowledge[:, -1, newaxis].astype(int32)

# Break up the information into coaching and testing units

x_train, x_test, y_train, y_test = ms.train_test_split(samples, goal, test_size=0.2, random_state=10)

The OpenCV library implements the RTrees_create operate within the ml module, which is able to permit us to create an empty determination tree:

# Create an empty determination tree

rtree = ml.RTrees_create()

# Create an empty determination tree

rtree = ml.RTrees_create()

All of the bushes within the forest will likely be educated with the identical parameter values, albeit on completely different subsets of the coaching dataset. The default parameter values could be personalized, however let’s first work with the default implementation. We are going to return to customizing these parameter values shortly within the subsequent part:

# Prepare the choice tree

rtree.practice(x_train, ml.ROW_SAMPLE, y_train)

# Predict the goal labels of the testing knowledge

_, y_pred = rtree.predict(x_test)

# Compute and print the achieved accuracy

accuracy = (sum(y_pred.astype(int32) == y_test) / y_test.dimension) * 100

print(‘Accuracy:’, accuracy[0], ‘%’)

# Prepare the choice tree

rtree.practice(x_train, ml.ROW_SAMPLE, y_train)

# Predict the goal labels of the testing knowledge

_, y_pred = rtree.predict(x_test)

# Compute and print the achieved accuracy

accuracy = (sum(y_pred.astype(int32) == y_test) / y_test.dimension) * 100

print(‘Accuracy:’, accuracy[0], ‘%’)

Accuracy: 96.72727272727273 %

Accuracy: 96.72727272727273 %

Now we have already obtained a excessive accuracy of round 96.73% utilizing the default implementation of the Random Forest algorithm on the banknote dataset.

The entire code itemizing is as follows:

from csv import reader

from numpy import array, float32, int32, newaxis

from cv2 import ml

from sklearn import model_selection as ms

# Perform to load the dataset

def load_csv(filename):

file = open(filename, “rt”)

traces = reader(file)

dataset = checklist(traces)

return dataset

# Perform to transform a string column to drift

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float32(row[column].strip())

# Load the dataset from textual content file

knowledge = load_csv(‘Information/data_banknote_authentication.txt’)

# Convert the dataset string numbers to drift

for i in vary(len(knowledge[0])):

str_column_to_float(knowledge, i)

# Convert checklist to array

knowledge = array(knowledge)

# Separate the dataset samples from the bottom reality

samples = knowledge[:, :4]

goal = knowledge[:, -1, newaxis].astype(int32)

# Break up the information into coaching and testing units

x_train, x_test, y_train, y_test = ms.train_test_split(samples, goal, test_size=0.2, random_state=10)

# Create an empty determination tree

rtree = ml.RTrees_create()

# Prepare the choice tree

rtree.practice(x_train, ml.ROW_SAMPLE, y_train)

# Predict the goal labels of the testing knowledge

_, y_pred = rtree.predict(x_test)

# Compute and print the achieved accuracy

accuracy = (sum(y_pred.astype(int32) == y_test) / y_test.dimension) * 100

print(‘Accuracy:’, accuracy[0], ‘%’)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

from csv import reader

from numpy import array, float32, int32, newaxis

from cv2 import ml

from sklearn import model_selection as ms

# Perform to load the dataset

def load_csv(filename):

file = open(filename, “rt”)

traces = reader(file)

dataset = checklist(traces)

return dataset

# Perform to transform a string column to drift

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float32(row[column].strip())

# Load the dataset from textual content file

knowledge = load_csv(‘Information/data_banknote_authentication.txt’)

# Convert the dataset string numbers to drift

for i in vary(len(knowledge[0])):

str_column_to_float(knowledge, i)

# Convert checklist to array

knowledge = array(knowledge)

# Separate the dataset samples from the bottom reality

samples = knowledge[:, :4]

goal = knowledge[:, -1, newaxis].astype(int32)

# Break up the information into coaching and testing units

x_train, x_test, y_train, y_test = ms.train_test_split(samples, goal, test_size=0.2, random_state=10)

# Create an empty determination tree

rtree = ml.RTrees_create()

# Prepare the choice tree

rtree.practice(x_train, ml.ROW_SAMPLE, y_train)

# Predict the goal labels of the testing knowledge

_, y_pred = rtree.predict(x_test)

# Compute and print the achieved accuracy

accuracy = (sum(y_pred.astype(int32) == y_test) / y_test.dimension) * 100

print(‘Accuracy:’, accuracy[0], ‘%’)

### Digits Case Research

Take into account making use of the Random Forest to pictures from OpenCV’s digits dataset.

The digits dataset remains to be comparatively easy. Nonetheless, the function vectors we’ll extract from its photographs utilizing the HOG methodology can have larger dimensionality (81 options) than these within the banknote dataset. Because of this, we will contemplate the digits dataset to be comparatively more difficult to work with than the banknote dataset.

We are going to first examine how the default implementation of the Random Forest algorithm copes with higher-dimensional knowledge:

from digits_dataset import split_images, split_data

from feature_extraction import hog_descriptors

from numpy import array, float32

from cv2 import ml

# Load the digits picture

img, sub_imgs = split_images(‘Photographs/digits.png’, 20)

# Acquire coaching and testing datasets from the digits picture

digits_train_imgs, digits_train_labels, digits_test_imgs, digits_test_labels = split_data(20, sub_imgs, 0.8)

# Convert the picture knowledge into HOG descriptors

digits_train_hog = hog_descriptors(digits_train_imgs)

digits_test_hog = hog_descriptors(digits_test_imgs)

# Create an empty determination tree

rtree_digits = ml.RTrees_create()

# Predict the goal labels of the testing knowledge

_, digits_test_pred = rtree_digits.predict(digits_test_hog)

# Compute and print the achieved accuracy

accuracy_digits = (sum(digits_test_pred.astype(int) == digits_test_labels) / digits_test_labels.dimension) * 100

print(‘Accuracy:’, accuracy_digits[0], ‘%’)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

from digits_dataset import split_images, split_data

from feature_extraction import hog_descriptors

from numpy import array, float32

from cv2 import ml

# Load the digits picture

img, sub_imgs = split_images(‘Photographs/digits.png’, 20)

# Acquire coaching and testing datasets from the digits picture

digits_train_imgs, digits_train_labels, digits_test_imgs, digits_test_labels = split_data(20, sub_imgs, 0.8)

# Convert the picture knowledge into HOG descriptors

digits_train_hog = hog_descriptors(digits_train_imgs)

digits_test_hog = hog_descriptors(digits_test_imgs)

# Create an empty determination tree

rtree_digits = ml.RTrees_create()

# Predict the goal labels of the testing knowledge

_, digits_test_pred = rtree_digits.predict(digits_test_hog)

# Compute and print the achieved accuracy

accuracy_digits = (sum(digits_test_pred.astype(int) == digits_test_labels) / digits_test_labels.dimension) * 100

print(‘Accuracy:’, accuracy_digits[0], ‘%’)

We discover that the default implementation returns an accuracy of 81%.

This drop in accuracy from that achieved on the banknote dataset might point out that the capability of the default implementation of the mannequin is probably not sufficient to be taught the complexity of the higher-dimensional knowledge that we at the moment are working with.

Let’s examine whether or not we might acquire an enchancment within the accuracy by altering:

The termination standards of the coaching algorithm, which considers the variety of bushes within the forest, and the estimated efficiency of the mannequin are measured by an Out-Of-Bag (OOB) error. The present termination standards could also be discovered by making use of the getTermCriteria methodology and set utilizing the setTermCriteria methodology. When utilizing the latter, the variety of bushes could also be set by the TERM_CRITERIA_MAX_ITER parameter, whereas the specified accuracy could also be specified utilizing the TERM_CRITERIA_EPS parameter.

The utmost attainable depth that every tree within the forest can attain. The present depth could also be discovered utilizing the getMaxDepth methodology, and set utilizing the setMaxDepth methodology. The desired tree depth is probably not reached if the above termination standards are met first.

When tweaking the above parameters, keep in mind that growing the variety of bushes can improve the mannequin’s capability to seize extra intricate element within the coaching knowledge; it’s going to additionally improve the prediction time linearly and make the mannequin extra vulnerable to overfitting. Therefore, tweak the parameters judiciously.

If we add within the following traces following the creation of an empty determination tree, we might discover the default values of the tree depth in addition to the termination standards:

print(‘Default tree depth:’, rtree_digits.getMaxDepth())

print(‘Default termination standards:’, rtree_digits.getTermCriteria())

print(‘Default tree depth:’, rtree_digits.getMaxDepth())

print(‘Default termination standards:’, rtree_digits.getTermCriteria())

Default tree depth: 5

Default termination standards: (3, 50, 0.1)

Default tree depth: 5

Default termination standards: (3, 50, 0.1)

On this method, we will see that, by default, every tree within the forest has a depth (or variety of ranges) equal to five, whereas the variety of bushes and desired accuracy are set to 50 and 0.1, respectively. The primary worth returned by the getTermCriteria methodology refers to the kind of termination standards into account, the place a worth of three specifies termination primarily based on each TERM_CRITERIA_MAX_ITER and TERM_CRITERIA_EPS.

Let’s now attempt altering the values talked about above to research their impact on the prediction accuracy. The code itemizing is as follows:

from digits_dataset import split_images, split_data

from feature_extraction import hog_descriptors

from numpy import array, float32

from cv2 import ml, TERM_CRITERIA_MAX_ITER, TERM_CRITERIA_EPS

# Load the digits picture

img, sub_imgs = split_images(‘Photographs/digits.png’, 20)

# Acquire coaching and testing datasets from the digits picture

digits_train_imgs, digits_train_labels, digits_test_imgs, digits_test_labels = split_data(20, sub_imgs, 0.8)

# Convert the picture knowledge into HOG descriptors

digits_train_hog = hog_descriptors(digits_train_imgs)

digits_test_hog = hog_descriptors(digits_test_imgs)

# Create an empty determination tree

rtree_digits = ml.RTrees_create()

# Learn the default parameter values

print(‘Default tree depth:’, rtree_digits.getMaxDepth())

print(‘Default termination standards:’, rtree_digits.getTermCriteria())

# Change the default parameter values

rtree_digits.setMaxDepth(15)

rtree_digits.setTermCriteria((TERM_CRITERIA_MAX_ITER + TERM_CRITERIA_EPS, 100, 0.01))

# Prepare the choice tree

rtree_digits.practice(digits_train_hog.astype(float32), ml.ROW_SAMPLE, digits_train_labels)

# Predict the goal labels of the testing knowledge

_, digits_test_pred = rtree_digits.predict(digits_test_hog)

# Compute and print the achieved accuracy

accuracy_digits = (sum(digits_test_pred.astype(int) == digits_test_labels) / digits_test_labels.dimension) * 100

print(‘Accuracy:’, accuracy_digits[0], ‘%’)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

from digits_dataset import split_images, split_data

from feature_extraction import hog_descriptors

from numpy import array, float32

from cv2 import ml, TERM_CRITERIA_MAX_ITER, TERM_CRITERIA_EPS

# Load the digits picture

img, sub_imgs = split_images(‘Photographs/digits.png’, 20)

# Acquire coaching and testing datasets from the digits picture

digits_train_imgs, digits_train_labels, digits_test_imgs, digits_test_labels = split_data(20, sub_imgs, 0.8)

# Convert the picture knowledge into HOG descriptors

digits_train_hog = hog_descriptors(digits_train_imgs)

digits_test_hog = hog_descriptors(digits_test_imgs)

# Create an empty determination tree

rtree_digits = ml.RTrees_create()

# Learn the default parameter values

print(‘Default tree depth:’, rtree_digits.getMaxDepth())

print(‘Default termination standards:’, rtree_digits.getTermCriteria())

# Change the default parameter values

rtree_digits.setMaxDepth(15)

rtree_digits.setTermCriteria((TERM_CRITERIA_MAX_ITER + TERM_CRITERIA_EPS, 100, 0.01))

# Prepare the choice tree

rtree_digits.practice(digits_train_hog.astype(float32), ml.ROW_SAMPLE, digits_train_labels)

# Predict the goal labels of the testing knowledge

_, digits_test_pred = rtree_digits.predict(digits_test_hog)

# Compute and print the achieved accuracy

accuracy_digits = (sum(digits_test_pred.astype(int) == digits_test_labels) / digits_test_labels.dimension) * 100

print(‘Accuracy:’, accuracy_digits[0], ‘%’)

We may even see that the newly set parameter values bump the prediction accuracy to 94.1%.

These parameter values are being set arbitrarily right here as an instance this instance. Nonetheless, it’s all the time suggested to take a extra systematic method to tweaking the parameters of a mannequin and investigating how every impacts its efficiency.

## Additional Studying

This part gives extra assets on the subject if you wish to go deeper.

### Books

### Web sites

## Abstract

On this tutorial, you discovered apply OpenCV’s Random Forest algorithm for picture classification, beginning with a comparatively simpler banknote dataset after which testing the algorithm on OpenCV’s digits dataset.

Particularly, you discovered:

A number of of crucial traits of the Random Forest algorithm.

The way to use the Random Forest algorithm for picture classification in OpenCV.

Do you may have any questions?

Ask your questions within the feedback under, and I’ll do my greatest to reply.

## Get Began on Machine Studying in OpenCV!

Learn to use machine studying strategies in picture processing tasks

…utilizing OpenCV in superior methods and work past pixels

Uncover how in my new E-book:

Machine Learing in OpenCV

It gives self-study tutorials with all working code in Python to show you from a novice to professional. It equips you with

logistic regression, random forest, SVM, k-means clustering, neural networks,

and way more…all utilizing the machine studying module in OpenCV

Kick-start your deep studying journey with hands-on workouts

See What’s Inside