1. Introduction
Animal husbandry is one of the most lucrative and demanding businesses worldwide and contributes significantly to the nation’s gross domestic product (GDP). As per the report published by the World Bank (2022), agriculture (and its allied sectors) accounts for almost 4.01% of the world’s GDP, which in developing countries significantly increases up to 25% [
1].
Figure 1 represents India’s GDP distribution, showing that agriculture contributes nearly 19% of the GDP [
2]. In particular, dairy farming contributes majorly (about 5.30%), with milk as the significant livestock product [
3]. As per the report published in 2020 by Indian National Accounts Statistics (NAS), the livestock sector contributes 4.19% of the total gross value added (GVA) and 28.63% of the total agriculture and allied sector GVA [
2]. These businesses are majorly governed by small, peripheral farmers and landless workers. Dairy farming is a secondary source of income for thousands of rural families and provides livelihood to two-thirds of the rural community. This sector is undergoing rapid growth due to urbanization, population growth, and, most importantly, the rise in income in developing countries. Further, the Indian policymakers also suggested self-sustainable models empowered with the more significant employment of technologies and market linkage to double farmers’ income by 2022 [
4].
1.1. Motivation
Proficient animal husbandry would allow stockmen (and eventually the associated companies or national agencies) to earn more profit. Their commercial value highly influences domestic cattle profit (particularly cows) and the cost to raise them. The mercantile value of cows mainly depends on their fertility rate, milk production, and the chemical composition of milk. Moreover, different breeds yield different milk varieties, since breed affects milk composition, fatty acid composition, and coagulation properties [
5,
6]. Due to the different stages of lactation in breeds, milk production is similar for one breed of cow but different from one breed to another. For instance, Gyr cows produce 900–1600 kg of milk, whereas Holstein Friesian produce 7200–9000 kg per lactation [
7]. This variation in milk yield between individual cows can turn into significant losses for businesses that encompass thousands of cows. Therefore, identifying a cow’s breed would benefit the dairy industry.
Due to the increase in population and decrease in farms, dairy livestock needs better monitoring for breed associations. Therefore, identifying the breed of individual cattle is a key to dairy farming. Breed identification of individual cows may offer information to the stockmen and assist in making important decisions about that animal, such as the opportunity for cross-breeding to enhance the production rate. Additionally, recognizing breeds plays a vital role in automatic behavior analysis, health monitoring, and the detection of lameness and helps estimate their fertility rate. Further, the individual identification and tracking of cow breeds may offer various farming opportunities for disease detection (e.g., early detection of disease outbreaks and transmission), disease prevention and treatment, fertility and feeding, and welfare monitoring. Identification of cow breeds would balance the trade-off between cost and management, thereby improving the productivity and profitability of dairy farms.
1.2. Related Work
Generally, breed identification methods have been classified into tag-based and visual-feature-based approaches. The tag-based methods use permanent markings (such as tattooing and ear notching), temporary markings (such as ear tagging), and electronic devices (such as radio frequency identification, RFID) [
8]. The permanent markings can only be applied to identify individuals in smaller groups, whereas the temporary markings have been found susceptible to fraudulent manipulation and easy duplication [
9]. Furthermore, these approaches require specific sensing devices on the body, which may require invasive techniques. These bottlenecks led to the development of RFID-based electronic identification devices. However, implementing RFID chips and scanners at various checkpoints is a challenging task and requires skilled persons, particularly while monitoring groups of animals. Moreover, these methods are prone to duplications or false identification when monitoring numerous livestock animals in harsh outdoor environments. In addition, these devices are expensive and can be easily damaged.
The reported literature reveals various visual-biometric-based cattle identification methodologies that utilize the unique external biometric characteristics of breeds (such as coat pattern, muzzle pattern, and body contour) to effectively address the limitations of tag-based techniques [
10]. However, the exact identification and extraction of these features are challenging, even for an expert, which limits the wide acceptance of earlier approaches. In contrast, deep learning (DL) techniques have established their sovereignty for complex object detection and recognition tasks [
11,
12]. Motivated by this, DL approaches have been successfully employed to extract the hidden features to classify and localize species such as sheep, dogs, and birds [
13,
14,
15]. Similar trends have been witnessed in preserving cow breeds for the state’s cultural and genetic heritage [
16].
Among various visual features, muzzle patterns have been widely used for cattle identification because of the distinct grooves and beaded prints [
17]. For example, DL methodologies have been employed to extract distinguishable features from muzzle images [
18,
19]. In another work, an auto-encoder and a deep belief network have been used to find hidden features of cow nosea [
20]. However, this approach ignored information about other essential body parts, such as head and legs, resulting in reduced accuracy. Further, retinal features have been used to identify cattle, but the difficulties in capturing livestock retinal images limit its practical applicability [
21]. Additionally, the hidden patterns of body coat and face have been exploited to identify individual cattle [
22]. Meanwhile, incorporating convolutional neural networks (CNNs) ensures the automatic extraction of rich features, resulting in the improved identification of cattle breeds [
23]. Later, beef cattle were detected in image sequences by fusing CNN and long short-term memory (LSTM) algorithms [
24]. Another pioneering work automatically detects Holstein Friesian cattle by extracting coat pattern features [
25,
26]. Further, it utilizes DL techniques (CNN and a recurrent convolutional network, RCN) on the ground and aerial view images for cattle breed identification. Similarly, computer vision techniques such as you only look once (YOLO) and region-based CNN (RCNN) have been employed to detect cattle breeds using their morphological features [
27,
28]. However, these models were limited to detecting only one breed and lacked emphasis on breed diversity.
Table 1 summarizes the DL-based previously reported cow breed detection models.
Based on the literature mentioned above, we identify the following research gaps:
Although the outcomes of the invasive techniques are promising, the experimental designs have certain flaws that make it difficult to evaluate the real significance of the reported results in harsh environments.
Cattle have previously been recognized based on the characteristics of specific body parts. However, other key body components such as the head and legs were left out, which may result in the loss of crucial information.
In most of the literature, the emphasis is on identifying a single breed. However, these approaches cannot identify and classify various breeds.
1.3. Contributions
To our knowledge, there is no model for the automatic detection of cow breed in animal biometrics literature that bridges the research gaps mentioned above. Therefore, there is a need to cater to improved and advanced methods for detecting cattle breeds based on overall body features. Thus, the present work presents the first proof-of-concept system for automatic cow breed detection. To summarize, the main contributions of this study are:
Development of a multi-breed cow detection framework based on YOLOv4 to identify and classify diverse cow breeds with high accuracy;
Development of a custom cow dataset containing multiple breeds using web-mining techniques;
Comparative performance analysis of simulated results to endorse the most prominent training parameters.
1.4. Structure of the Paper
The remaining parts of the paper are as follows:
Section 2 provides a theoretical overview of the DL algorithm (YOLOv4) used to extract cattle features from images.
Section 3 illustrates the work methodology, including database preparation and brief work analysis.
Section 4 reports the experimental results. It also includes quantitative evaluations and a comparative analysis of the obtained results. Finally,
Section 5 provides the conclusions and future directions.
2. Framework for Cow Breed Detection
Real-time vision-based applications not only require accuracy but also demand fast detection with the ability to recognize a wide variety of objects. Although traditional object detection algorithms (like RCNN, fast RCNN, and faster RCNN) provide accurate detection, they are slower [
29]. Therefore, to increase the detection speed, a single-shot detector (SSD) has been introduced, which can detect multiple objects at a significant rate of 22–59 frames per second (FPS) [
30]. However, it exhibits poor accuracy in the detection of small objects. Unlike conventional CNN architectures, YOLOv4 can be easily used in real-time applications due to its fast and accurate detection. Therefore, YOLOv4 seems to be a perfect choice for object detection tasks. This algorithm is based on regression, i.e., instead of selecting regions of interest in an image, it predicts classes and bounding boxes for the whole image in one algorithm run. The parameters required to describe a bounding box are:
- 4.
Bounding box’s centre (bx and by)
- 5.
Width (bw)
- 6.
Height (bh)
- 7.
Class of an object (c) (such as Marchigiana, White Park, etc.).
Along with the above-mentioned parameters, YOLOv4 also predicts the probability of containing an object (p
c) in the bounding box as illustrated in
Figure 2.
The first edition of YOLO was YOLOv1 (with 24 convolutional layers), which was trained on the ImageNet-1000 dataset. It can detect objects with a speed of 45 FPS [
31]. It outperforms conventional detection methods (like DPM and R-CNN) in terms of accuracy and speed. However, it has difficulty detecting small objects, mainly if they appear as a cluster. Therefore, another version of YOLO (known as YOLOv2) was introduced, which significantly improved the performance of object detection models. It offers the accuracy of faster R-CNN and the speed of SSD [
32]. Due to the multi-scale training of the YOLOv2 network, it can detect and classify objects with different configurations and dimensions. Compared to its predecessor (i.e., YOLOv1), YOLOv2 can detect smaller objects more accurately. To make object detection algorithms more accurate and faster, YOLOv3 was launched, which accurately classifies objects in real-time applications [
33]. For multi-label classification, it uses logistic classifiers instead of SoftMax. In 2020, YOLO evolved into YOLOv4, which uses YOLOv3 as its head with some changes in the backbone and neck [
34]. It gives remarkable results, with a hike of 10% in accuracy and 12% in speed compared to YOLOv3. Therefore, in this study, YOLOv4 is used to detect cow breeds. The following subsections discuss the features and architecture of the YOLOv4 detection network.
2.1. YOLOv4
YOLOv4 has the edge over YOLOv3, as it implements a new architecture in the backbone, modifies the neck, and achieves a real-time speed of 65 FPS on Tesla V100. In addition, there is no need to use expensive GPUs for training i.e., training can be done on a single conventional GPU with great accuracy. YOLOv4 integrates special features within the bag of freebies and bag of specials as discussed below:
Bag of freebies (BoF): Accuracy is improved by changing the training strategy without increasing inference costs. To increase the robustness of images obtained from distinct environments, it uses data augmentation, which increases the variability of the input images. Furthermore, it solves the problem of photometric distortion by adjusting an images’ brightness, hue, saturation, contrast, and noise. For geometric distortion, input images are randomly scaled, cropped, flipped, and rotated at some angle. In addition to data augmentation, BoF also solves object occlusion issues.
Bag of specials (BoS): It contains different post-processing modules that significantly enhance object detection accuracy at the cost of a slight rise in inference time.
Figure 3 illustrates various methods present in BoS.
2.2. YOLOv4 Architecture
As shown in
Figure 4, the YOLOv4 architecture has three parent blocks: backbone, neck, and head (dense prediction).
Backbone: The CSPDarknet53 network is used as a backbone to extract essential features from the input image. Cross-stage-partial net (CSPNet) divides the feature map of the base layer into two segments, as illustrated in
Figure 5. A dense block contains multiple convolution layers that take the output of all the preceding layers and merge them with the current layer. DenseNet contains multiple dense blocks connected with transition layers (including convolution and pooling layers).
Neck: The neck’s main contribution to detection is combining feature maps from different stages. It enhances the information gathered from the backbone layer and feeds it into the head. It concatenates semantic-rich information (from the feature map of the top-down stream) with the spatial-rich information (from the bottom-up stream’s feature map) and feeds the concatenated output into the head.
Head: To perform dense prediction, YOLOv3 serves as the head of the YOLOv4 architecture. As a result, it provides the final prediction along with a vector containing the predicted coordinates of the bounding box and the associated confidence score with a class label.
3. Methodology
Figure 6 illustrates the flow chart of the training process using the YOLOv4 algorithm. The first task in any DL algorithm is to prepare the dataset. For this purpose, 1835 images (which contain eight breeds of cows) are collected using web mining techniques. However, as DL models are data-driven, data augmentation has been performed on the acquired images to avoid the risk of overfitting. As shown in
Figure 7, data augmentation involves a group of techniques that enhance the size of training datasets. An instance of augmented images is shown in
Figure 8.
Further, the present work employs the transfer learning approach because a large dataset is required to train the model from scratch. Therefore, pre-trained weights (yolov4.conv.137) are applied as initial weights at the beginning of training. Moreover, the developed custom dataset of 1835 images has been randomly segmented into the subsequent phases: (1) training phase in which 90% of images (1662) are employed to train the proposed model, (2) validation phase containing 141 samples to validate the model, and (3) testing phase, which includes the remaining 32 images. The division of the dataset for each breed is illustrated in
Table 2.
As discussed earlier, the network is trained by YOLOv4 for the detection of eight breeds of a cow (Afrikaner, Brown Swiss, Gyr, Holstein Friesian, Limousin, Marchigiana, White Park, and Simmental cattle). Moreover, the performance of the YOLOv4 algorithm is evaluated by training the model on different sets of training parameters. The parametric settings used to train the model via YOLOv4 are tabulated in
Table 3. The whole investigation is performed on Nvidia RTX 2060 GPU, and the environment uses Visual Studio 2017 to compile the entire script.
3.1. Evaluation Metrics
During training, intersection over union (
IoU) is calculated by matching the detected bounding box with the ground truth box. It can be determined via Equation (1) [
35].
Figure 9 illustrates an example of
IoU computation. In this example, the
IoU_threshold has been taken as 0.5. If the prediction is greater than this threshold, it is classified a true positive; otherwise, detection is designated a false positive. Thus, by changing the
IoU_threshold, the model will give different true or false positives for the same prediction. The results are validated by computing the below-mentioned performance metrics:
3.1.1. Precision–Recall (PR) Curve
PR curves summarize the trade-off between precision and recall values. It is plotted at different probability thresholds representing precision along the y-axis and recall along the x-axis.
Precision: The detected bounding box is compared with the ground-truth box and describes how good a model is at predicting the positive class. It is represented by Equation (2) [
12].
here,
NTP = number of predictions that resembles the ground-truth boxes (true positive)
NFP = number of false detections (false positives)
Recall: Recall denotes the sensitivity, i.e., how many positive predictions are captured from total ground-truth boxes. It is generally expressed by Equation (3) [
36].
here,
NFN = number of ground-truth objects that could not be detected (false negatives)
A model with perfect skill is depicted as a point at (1, 1) where both
precision and
recall values are high. Therefore, the accuracy of the model increases as it moves toward point (1, 1). The area under the PR curve is known as average precision (
AP), and the mean of
APs for all classes is termed as the mean average precision (
mAP). These are represented by Equations (4) and (5), respectively [
37].
Precision and recall are encapsulated in another well-known evaluation metric, the
F1 score. It is the harmonic mean of
precision and
recall and computed by Equation (6) [
38].
3.1.2. Confusion Matrix
The ratio of true positives to total predictions made determines the Classification accuracy. It can be misleading if the data has more than two classes or does not have a balanced dataset. For example, if a classification accuracy of 90% is obtained, it does not mean that all classes are being predicted equally. There is a probability that the model neglects one or two categories. Good accuracy can be achieved by predicting the most common class value, i.e., the class with the maximum number of training images. So, to visualize the model’s performance, a confusion matrix is employed. It summarizes prediction results with the number of correct and incorrect predictions encapsulated class-wise in a matrix, as shown in
Figure 10 [
39].
A confusion matrix is not only limited to true/false positives but also helps in estimating the performance by calculating other evaluation metrics including accuracy and kappa.
Overall accuracy (
OA): The perfect classification is represented with 100% accuracy where all of the classes were classified correctly. It is calculated by Equation (7):
The diagonal elements in the matrix represent the number of predictions classified correctly, and the total number of values represents the total objects detected. OA is the easiest to analyze from the confusion matrix. However, it only provides basic accuracy information as the class-wise evaluation is missing. Therefore, class-wise precision, recall, and Cohen’s kappa are calculated from the matrix, as discussed below:
Precision and Recall: The confusion matrix helps in estimating precision and recall for each class. The precision and recall values are calculated using Equations (2) and (3).
Cohen’s Kappa: Another metric calculated from the confusion matrix is
kappa. It measures the agreement between classification and truth values, i.e., it calculates inter-rater reliability. It is generally a more robust measurement than simple accuracy computation, as
kappa considers the possibility of the agreement occurring by chance (false positives). Mathematically, it is computed by Equation (8) [
40].
here,
n: total number of classes,
N: total number of classified values compared to truth values,
: values found along the diagonal of the confusion matrix,
: total number of predictions belonging to class I,
: total number of truth values belonging to class
i.
It varies in the range (0, 1). The lower bound shows no agreement, whereas the upper bound indicates perfect agreement. Practically, Cohen’s kappa removes the possibility of random-guess agreeing.
4. Experimental Results and Discussion
Training any DL model from scratch requires substantial data and computational resources. Therefore, the transfer learning approach is employed to develop the proposed cow breed detection model using pre-trained weights of YOLOv4. This helps the model to learn better attributes, resulting in improved detection capability. Further, the model is trained for 20,000 iterations with the hyperparameters mentioned in
Table 3. The training graph of the model with the developed custom dataset is mapped in
Figure 11.
The corresponding performance parameters are provided in
Table 4. Training time increases when the image size (i.e., width × height) increases. It indicates that YOLOv4 runs faster for smaller images. From the graph illustrated in
Figure 11, it is spotted that the
mAP value increases gradually when the
IoU_threshold is 0.75 as compared to 0.50. By changing the
IoU_threshold, the model will give different TRUE or FALSE positives for the same prediction, consequently affecting the
mAP values.
Figure 12 shows the statistical analysis of performance metrics obtained from the last epoch.
Table 4 and
Figure 12 show that recall is higher (almost 2%) when the
IoU_threshold is low, as more predictions turn to positives at a low threshold. However,
mAP increases by 0.5% for an
IoU_threshold of 0.75 compared to an
IoU_threshold of 0.5. A probable reason might be the decrease in false positives with an increase in the threshold.
The dataset contains only 32 test images that generally have only one type of breed. Therefore, to explore the efficiency of the model under complex circumstances, a collage is developed containing images of cows from multiple breeds. The detection results on sample test images along with collage are shown in
Figure 13, and its comparative analysis is demonstrated in
Table 5. It is observed that the model gives high
precision (>2%) when image resolution increases. It is also noticed that detection time increases by 65% with an increase in image size as YOLOv4 runs slow on large-sized images.
4.1. PR Curve
Figure 14 shows the PR curves for different cases of network size and
IoU_threshold. From
Figure 14, it is observed that the curves are closer to (1, 1) for each class for the fourth case (608 × 608, 0.75) compared to other ones. Furthermore,
mAP is calculated from PR curves, which are greater in ‘608 × 608, 0.75 ’ by 0.17%, 0.15%, and 0.01% compared to ‘416 × 416, 0.50’, ‘416 × 416, 0.75’, and ‘608 × 608, 0.50’ respectively. This indicates that the PR curve shows visible change when the image resolution changes.
4.2. Confusion Matrix
As discussed earlier, the image resolution and
IoU_threshold significantly influence the performance of the object detection model. This is also reflected in the confusion matrix, as illustrated in
Figure 15. In this study, predictions were taken as positive if the confidence score was greater than 1% to compute the confusion matrix. Since cows of studied classes share almost similar traits, it is very difficult for the model to extract features with lower threshold and image resolution. Hence, a large amount of misidentification and misclassification have been witnessed, particularly with 416 × 416, 0.50. However, fine features may be recognized as image resolution improves, reducing the frequency of erroneous detection.
OA calculated from confusion matrices is illustrated in
Figure 16. It is observed that
OA is greater in the fourth case (i.e., ‘608 × 608, 0.75’) by 33.27%, 3.28%, and 19.70% than ‘416 × 416, 0.50’, ‘416 × 416, 0.75’, and ‘608 × 608, 0.50’ respectively. This also validates the hypothesis that the performance of the model improves when the image resolution and threshold increase.
To further validate the model, class-wise
precision and
recall values are computed. The results are presented in
Figure 17 and
Figure 18, respectively. The
precision and
recall values are small for an
IoU_threshold of 0.50 compared to 0.75 as the total objects detected are lower in the latter (reflected from confusion matrices in
Figure 15).
Due to diversity in training image size, precision and recall follow the random trend when the images are resized from 416 × 416 to 608 × 608. The model performance is affected because the model uses a zero padding technique to fit the input image into the required image size.
Figure 19 also supports that the detection accuracy increases when both image resolution and threshold increases.
Kappa is found to be greater in the fourth case (608 × 608, 0.75) with a hike of 42.10%, 3.84%, and 23.60% compared to ‘416 × 416, 0.50’, ‘416 × 416, 0.75’, and ‘608 × 608, 0.50’ respectively.
4.3. Comparison with State-of-the-Art Models
The results above demonstrate the improved detection accuracy of the proposed model with 608 × 608 image resolution and an
IoU_threshold of 0.75. Further, it is observed that most of the reported work detects only one breed. Therefore, the developed custom dataset is employed to train three other popular detection techniques (faster RCNN, SSD, and YOLOv3) for a fair comparison.
Table 6 presents the class-wise accuracy of the proposed model relative to faster RCNN, SSD, and YOLOv3. This table also compares the speed of detection by these models.
It was computed that the model with YOLOv4 outclasses faster RCNN by a minimum and maximum margin of 0.45% for Limousin breed and 11.60% for Gyr class, respectively. Similarly, YOLOv4 dominates the detection accuracy of SSD by a minimum margin of 9.00% for all the categories. Similar trends were witnessed while comparing the performance of YOLOv4 with YOLOv3, as the lowest boost (3.11%) was observed for White Park. Moreover, the model developed with YOLOv4 significantly uplifted the inference speed.
To conclude, the experiments and results reported in this section validate the credibility of this work and place the proposed model among the top-ranked cow breed detection models. The methodology proposed in this work could be used in real-time scenarios to detect cow breeds, thereby assisting in the improvement of automatic livestock farming.
5. Conclusions
This work proposes a vision-based model to recognize the breed of a cow. YOLOv4, a DL algorithm, is applied to learn a discriminatory feature of cows with a limited training dataset. For this, a custom dataset for eight breeds of cow (Afrikaner, Brown Swiss, Gyr, Holstein Friesian, Limousin, Marchigiana, White Park, and Simmental cattle) was generated. To test the efficiency of the algorithm, PR curves and confusion matrix were drawn on test images, which demonstrates that the YOLOv4 algorithm works better with an image size of 608 × 608 and IoU_threshold of 0.75. Furthermore, mAP calculated from PR curves improve by 0.17%, 0.15%, and 0.01% in training image size and IoU_thresholds of ‘416 × 416, 0.50’, ‘416 × 416, 0.75’ and ‘608 × 608, 0.50’, respectively. Consequently, the PR curve shows visible changes, with variations in image resolution. Overall accuracy, OA, calculated from the confusion matrix is more in ‘608 × 608, 0.75’ by 33.27%, 3.28%, and 19.70% than ‘416 × 416, 0.50’, ‘416 × 416, 0.75’, and ‘608 × 608, 0.50’, respectively. Another metric, i.e., kappa, also indicates that the model performs better when the image size and the IoU_threshold are ‘608 × 608, 0.75’. In ‘608 × 608, 0.75’, Kappa increases by 42.1%, 3.84% and 23.60% than ‘416 × 416, 0.50’, ‘416 × 416, 0.75’, and ‘608 × 608, 0.50’, respectively. Overall, the experimental results demonstrate that the model accuracy can be improved by training YOLOv4 on images with high resolution, with a greater IoU_threshold. Further, the developed cow breed model is compared with the models developed by employing faster RCNN, SSD, and YOLOv3. The comparative analysis validates the improved performance of the cow breed detection model with YOLOv4.
Further research will focus on video tracking for effective identification via surveillance. As the present work on the individual identification of the cow breed (through images) yields highly accurate results, it would be interesting to incorporate simple tracking techniques between video frames to check the efficiency of this work, more precisely in case of the heavy bunching of cows. Further, the proposed methodology needs to be trained on aerial-view-based datasets with multiple breeds to enhance the accuracy and robustness of the model. This will be addressed in future works. In addition, the scalability of our approach to large populations remains to be tested. This will open new doors to deploy vision-based algorithms for the precision livestock farming sector.