Underwater image segmentation in the wild using deep learning

Image segmentation is an important step in many computer vision and image processing algorithms. It is often adopted in tasks such as object detection, classification, and tracking. The segmentation of underwater images is a challenging problem as the water and particles present in the water scatter and absorb the light rays. These effects make the application of traditional segmentation methods cumbersome. Besides that, to use the state-of-the-art segmentation methods to face this problem, which are based on deep learning, an underwater image segmentation dataset must be proposed. So, in this paper, we develop a dataset of real underwater images, and some other combinations using simulated data, to allow the training of two of the best deep learning segmentation architectures, aiming to deal with segmentation of underwater images in the wild. In addition to models trained in these datasets, fine-tuning and image restoration strategies are explored too. To do a more meaningful evaluation, all the models are compared in the testing set of real underwater images. We show that methods obtain impressive results, mainly when trained with our real dataset, comparing with manually segmented ground truth, even using a relatively small number of labeled underwater training images.


Introduction
The segmentation of underwater images presents many applications in areas such as subsea inspection and biological research. Even a simple background subtraction, if it has a high accuracy, can be an important part of more complex tasks, such as animal counting [1,2], image restoration, and robot obstacle avoidance [3,4]. With that purpose, the segmentation method must be able to segment underwater images that are in the wild, and not in a controlled environment. As the main example, a technique that simply divide the input image in two classes, background and foreground, provides a valuable information for an algorithm that is responsible for an underwater robot obstacle avoidance, since it can show the regions of the image where there are possible objects to collide. However, underwater images exhibit some particular characteristics which make their handling more difficulty, including blurriness, reduced contrast, and distorted colors [5,6]. Because of this, standard segmentation algorithms cannot be directly applied to underwater images. Thus, the purpose of this paper is to explore two of the state-of-theart deep learning segmentation architectures, together with restoration and fine-tuning techniques using underwater segmentation datasets also made available through this paper.
Convolutional neural networks (CNNs) are the current state of the art in image segmentation [7]. Thus, an evident solution to the problem of underwater image segmentation is the adaptation of state-of-the-art segmentation architectures to deal with underwater images. However, deep CNNs generally require more than thousands sample images to be properly trained. As the manual segmentation of images is a labor-intensive task, building a dataset as large as those usually used in other deep learning problems would take a considerable amount of time and resources. To overcome this problem, there are a few possible strategies.
The first one is to pre-train the network on the segmentation of non-underwater images and then fine tune it using a smaller dataset of manually labeled real underwater images. This approach is known as transfer learning [8]. Another approach is to use simulated data, which allows a larger number of training samples, but results in a less realistic dataset. Finally, we can try to pre-process the input to remove the effects of underwater degradation before segmenting it with a CNN trained with non-underwater images. We evaluated all these strategies in state-of-the-art image segmentation CNN architectures.
In this paper, we propose four datasets, mainly one composed of real manually annotated underwater images, to train deep CNN architectures to the task of underwater image segmentation. Furthermore, we present several deep learning solutions based on two state-of-the-art segmentation architectures, using different pre-processing and pre-training steps. All the setups are trained in each developed dataset, and after, the consequent models are evaluated using the ground truth of the real testing set as reference.
To the best of our knowledge, we are the first work to use a CNN approach to underwater image segmentation in the wild. But the main contribution, which is what allow the use of CNN, is our dataset of real underwater images in the wild and their respective ground truths, which is made publicly available. We hope this dataset helps researchers to evaluate and improve underwater segmentation methods.
The remainder of the paper is organized as follows: the "Related work" section shows the related works in the areas of underwater image segmentation and image segmentation using CNNs; the "Methodology" section presents the proposed methodology; the "Experimental results" section evaluates the obtained results. Finally, we summarize the paper contributions and draw the future research directions in the "Conclusion" section.

Deep learning-based segmentation
In recent years, convolutional neural networks have become the state-of-the-art in the area of image segmentation, including high-level semantic segmentation. In [9], a texture segmentation and classification method based on features extracted by image classification CNNs trained on the ImageNet ILSVRC [10] dataset is proposed, achieving state-of-the-art performance on several datasets. Besides that, in [11], an end-to-end pixel-wise semantic segmentation using fully convolutional networks is performed. The main advantage of these models over standard CNNs is that they lack fully connected layers, allowing them to operate on inputs of variable size without the need to mod- ify the network's architecture. Following, the Mask R-CNN [12], that is an extension of the Faster R-CNN [13], objects detection architecture, achieving state-of-the-art instance segmentation results. The current state of the art in the PASCAL VOC 1 semantic segmentation challenge is the DeepLab neural architecture [14], which has DeepLabv3+ [15] as the newest version, which outperforms other networks in the semantic segmentation task. The success of these architectures in their respective segmentation tasks leads us to believe that CNNs are the most promising approach to achieve good underwater segmentation results.

Underwater image segmentation
Several approaches have been applied to the problem of underwater image segmentation. In [16], segmentation of underwater images technique is presented, using CLAHE histogram equalization followed by histogram thresholding. In [17], the underwater segmentation is performed by measuring the Mahalanobis distance between each pixel and the background color estimated from sample background images. In [18], a Particle Swarm Optimization (PSO) is used to maximize the entropy for underwater image segmentation. The same technique is adopted by [19] and [20], but using C-means to cluster the pixels. In [21], the underwater images are filtered with median filter, segmented them with K-Means clustering, and the image features are extracted using HOG, and then, used to classify the segments with an SVM classifier. Using a similar strategy, a novel solution [22] improve the selection of initial centroids of K-Means, which leads to better results, while increasing the computational cost. Also being a newer solution, in [23] an active contour strategy is used, minimizing an energy function to get the segmentation mask of the object in the underwater image. Already, in [24], a deep learning technique is used, in which a fully convolutional network is used to perform frame by frame fish segmentation in underwater videos. They use a weakly-labeled dataset of videos whose ground truth is derived from a motion-based background subtraction (BGS) technique [25] rather than manual annotations. The authors evaluate the precision and recall of their model in fish detection, but not the quality of the segmentation masks on a per pixel basis. In [26], a candidate object region is extracted from the image based on the presence of artificial light estimated from optical features. The region is segmented using parametric kernel graph cuts [27]. The main drawback of this method is to rely on the presence of artificial light in the image and therefore will not work properly in situations where the only source of illumination is natural light. While these methods are able achieve to segment in certain situations, they still rely on heuristics or weakly labeled data. Inspired by the success of deep learning architectures in several difficult computer vision tasks, we aim to develop a more general solution, based on a reliable, manually labeled small set of training images.

Methodology
The most straightforward approach to obtain powerful underwater image segmentation solution is to train a state-of-the-art segmentation CNN architecture with underwater images. The main obstacle to this approach, however, is that, to the best of our knowledge, no public adequate underwater segmentation dataset exists. The manual segmentation of (2021) 27:12 Page 4 of 14 underwater images is a relatively simple, but labor-intensive process. Therefore, it would be extremely impractical to create a dataset as large as those usually used to train deep CNN architectures from scratch, as such models generally require thousands of samples to be properly trained.
We can circumvent this problem by pre-training the network using a large semantic segmentation dataset of non-underwater images and performing fine tuning using a much smaller dataset composed of manually segmented underwater images. The idea is low level features learned during the initial training help the network in the segmentation of underwater images. So, some datasets can be proposed, even with a relatively small number of images, to train deep leaning segmentation models and to be used as a benchmark for comparison.
In the next sections, we present the proposed datasets created using both real and simulated images in the wild. Furthermore, we introduce the adopted neural architectures and the training process.

Datasets
There are some datasets for specific underwater task, e.g., fish detection and classification 2,3 . However, these dataset are not compatible with our problem of underwater image segmentation in the wild, since they are focused in fishing. Thus, we created our datasets 4 .
NAUTEC UWI Real-Our Real Underwater dataset is composed of 700 underwater images in the wild collected from the Internet. The images were manually segmented in foreground and background pixels. We randomly use 400 images for training and 300 for testing. Three sample images from the dataset and their respective ground truth can be seen in Fig. 1. The dataset contains images acquired in several water conditions, illumination, and places, containing images in both benthic and pelagic zones without differentiating one from another. There are naturally and artificial lit images. Furthermore, divers, marine life, and many underwater objects are present in these images acquired in the wild. This dataset is available in an additional material of this work.
Manually segmenting underwater images is a labor-intensive, time-consuming task. Because of this, our real underwater dataset is relatively small. The use of simulated images can increase the amount of training data, which can be created by simulating the effects of underwater degradation on non-underwater images whose segmentation labels are available. These effects can be created according to the Jaffe-McGlamery optical model [29,30], as adopted in [31]. We adopted a simplified version of the model where the forward scattering is neglected since the backscattering is the principal responsible by the image degradation [32]. We also use a set of real underwater image patch from a backscattering area that provide us medium parameters. These simulated effects are similar to those presented by Duarte et al. [33]. However, the model requires the availability of the image's depth map. While we believe outdoor scenes would be more adequate as they are closer to subsea images, we are forced to base our simulation on indoor images. To the best of our knowledge, there are no publicly available outdoor datasets with high-quality depth maps. Despite the obvious differences between indoor and underwater scenes, the network is expected to learn how  [28] to perform the segmentation of objects obscured by underwater degradation in a more general way by using these data. This capability of the network to learn the attenuation effect generated by the light traveling in the water can be achieved using this simulated dataset [31]. We use NYU Depth V2 [34] as the basis of our simulated data. This dataset provides images with segmentation labels and high-quality depth maps. We modified the original segmentation labels by considering pixels labeled as wall, roof, floor, etc., as background and pixels labeled as objects as foreground. We create three additional datasets using this data: NAUTEC UWI Sim200: This dataset is composed of 200 simulated images with relatively low turbidity. A simulated image of this dataset is present in the left of Fig. 2.
NAUTEC UWI Mixed: This dataset is the union of the Sim200 and Real Underwater datasets.
NAUTEC UWI Sim1000: This dataset is composed of 1000 simulated underwater images with four additional levels of increasing simulated underwater turbidity in rela- tion to the Sim200. An image from this dataset with its five levels of turbidity is shown in Fig. 2.

Network architectures
Designing a deep learning architecture from scratch is an arduous and time-consuming task. Thus, we evaluated two well-known semantic segmentation architectures in this work: SegNet [7] and DeepLabv3+ [15]. The evaluated datasets for inland images have been previously described. We have used 10% of the training dataset for validation.

SegNet
The SegNet is a fully convolution encoder-decoder semantic segmentation architecture, as shown in Fig. 3. Its encoder network is topologically identical to the 13 convolutional layers in the VGG16 [35] image classification network. The main advantage of the Seg-Net over competing segmentation architectures is the reduction in memory use provided by its decoder network architecture. We choose to evaluate the SegNet because it is a classical image segmentation architecture based on CNNs. We run 50,000 training iterations with a batch size of 5 using the Adam optimization algorithm [36] with a learning rate of 1.5 × 10 −4 , β 1 = 0.9, β 2 = 0.999 and = 10 −8 . The network is initialized using random weights. Furthermore, we also pre-trained the weights using the dataset PASCAL VOC dataset [37]. As our objective is to segment underwater images in foreground and background, we ignore the class information and simply consider pixels labeled as any of the 20 Pascal VOC classes to be foreground. We use 10,582 images for training and 300 for validation in the pre-training step. In this case, we resize the inputs to a width of 480 × 360 pixels and train until the convergence. The network parameters are also trained using underwater images. We evaluate the network using the previously described underwater datasets for 50,000 iterations using the same hyperparameters.

DeepLabv3+
As the most deep learning segmentation architectures, DeepLabv3+ is an encoderdecoder network; moreover, it is fully convolutional and, currently, one of the top ranked architectures in the PASCAL VOC segmentation challenge. The main idea behind the architecture is to preserve spatial information by reducing the number of strided pooling operations. Furthermore, atrous convolutions compensate the reduction in the receptive field. Another adopted technique is the detection of objects at multiple scales by parallel atrous convolutions at different sampling rates. All these characteristics are showed in the DeepLabv3+ architecture, as shown in Fig. 4. We choose the DeepLab architecture because it is the state-of-the-art in semantic segmentation.
Training a large architecture such as Deeplab from scratch is difficult, specially when the amount of available data is limited. Differently from Segnet that is initialized in using random weights, we start the training by initializing the model with Xcpetion [38] backbone weights. We also pre-trained the model on the PASCAL VOC dataset for 20,000 iterations with a batch size of 8, randomly cropping the inputs to the size of 513×513. We employ common data augmentation methods, such as input scaling and mirroring. We use SGD with a momentum of 0.9 and polynomial learning rate decay with a base learning rate of 10 −4 and power = 0.9. Weight decay is set to 4 × 10 −5 . Finally, we train the model using the previously described underwater datasets for 20000 iterations using the same training setup.

Experimental results
We evaluate our models on the remaining 300 randomly selected images from the real underwater dataset that were not presented in the training step. The results are evaluated using the standard mean Intersection over Union (mIoU) metric. We also take the raw network output, with no additional post-processing.
For the sake of a fair comparison, we also evaluate the networks using a state-of-the-art underwater image restoration algorithm as a pre-processing step. Our idea is to reduce the effects of the water that makes the segmentation difficult. We adopted the Underwater Dark Channel Prior method (UDCP) [39] and the Underwater GAN (UGAN) [40] before segmenting the image with the models trained only on the PASCAL VOC dataset. Although the pre-processing using restoration methods sounds a promising idea, mainly with UGAN, the results are not competitive.
We do not present a comparison with underwater segmentation methods. All of them use classical methodologies that are unable to segment the underwater images in a proper way. We found the results of these methods to not be even remotely competitive to the results obtained using deep neural network. Table 1 shows the mIoU accuracy for all evaluated models. Figures 5, 6, and 7 show qualitative underwater segmentation results.
The results show a CNN approach is a viable approach to the task of underwater image segmentation even though a limited amount of training images. The best network is achieved using a DeepLab architecture trained using our NAUTEC UWI Real underwater The segmentation performance is slightly reduced when the real dataset is augmented with simulated images. Models only trained with simulated data could not produce satisfactory results, but are still preferable over no fine-tuning at all.
The main reason is our simulated images are based on indoor scenes that is distinct from the real images of the testing set. The use of simulated images is due to the requirement of depth maps that can be only properly obtained in indoor environments, as described in the "Datasets" section. However, models trained with simulated data perform better than the ones trained with real data in images where the background is the Qualitative results obtained using another sample image from the NAUTEC UWI Real underwater dataset on our networks with different training data sea floor rather than pure water, such as the sample image shown in Fig. 6. We believe the simulated dataset present a diverse structure in relation to the background that is similar to presented in this testing image. Thus, the networks trained with this dataset is able to segment in a better way than the networks using only real data in this type of test image. The results of pre-trained DeepLabv3+ are better than the pre-trained SegNet. However, the results without pre-training in SegNet are better. We believe the main reason for this is the larger size and complexity of the DeepLab architecture. Larger architectures are generally more prone to overfitting that imply inferior generalization performance, specially when the amount of training data is limited. Despite this, the SegNet adopted in this work lacks in size and complexity to achieve a competitive performance, even given the relatively small amount of training samples. In addition to that, it is normal that a more complex network has a higher computational cost, what is also true in this case. When running the validation process, using a computer with an Intel Core i7-7700K, 16GB RAM and a NVIDIA GeForce GTX Titan X 12GB, DeepLabv3+ achieve a mean of 10.06 FPS, already SegNet achieve 16.54 FPS. Surprisingly, the use of pre-processing with UDCP [39] harmed the segmentation performance of both architectures. However, UGAN [40] improved the performance in DeepLab segmentation. We believe this happened because UDCP tend to produce some artifacts that may confuse the network, leading to inaccurate segmentation. Already, UGAN as a deep learning technique reaches better results in restoration process without producing to much artifacts that can confuse the network.

Conclusion
In this paper, we presented a set of datasets to train deep CNN architectures to the task of underwater image segmentation in the wild. We evaluated the impact of pre-training and simulated training data on the network performance. We also present a working solution based on DeepLabv3+ image segmentation architecture achieving a mIoU accuracy of ≈ 91.9% on a random test set of 300 real underwater images. We prove that this network architecture is able to properly segment with a small number of training images. Qualitative evaluation leads us to believe that our results are superior to those of traditional underwater segmentation methods. Another important contribution is our publicly available dataset of 700 manually segmented underwater images in the wild and their respective ground truths. To the best of our knowledge, we are the first work to present a CNN approach to underwater image segmentation in the wild.
Future work includes the evaluation of other network architectures and generative adversarial networks [41], which could help in removing small artifacts that are not correctly penalized by simple loss functions. We also plan to increase the number of images of our real dataset.