Artificial intelligence enabled automated diagnosis and grading of ulcerative colitis endoscopy images | Scientific Reports – Nature.com

Posted: February 19, 2022 at 8:57 pm

Dataset

Kvasir is a multi-class dataset from Brum Hospital in Vestre Viken Health Trust (Norway), collected from 2010 to 201424. Kvasir (v2) contains 8000 endoscopic images labelled with eight distinct classes, with approximately 1000 images per class, including ulcerative colitis. The images are assigned only image-level labels, provided by at least one experienced endoscopist as well as medical trainees (minimum of 3 reviewers per label). The images are independent, with only one image per patient.

Standard endoscopy equipment was used. HyperKvasir is an extension of the Kvasir dataset, collected from the same Brum Hospital from 2008 to 2016, containing 110,079 images, 10,662 of which are labelled with 23 classes of findings25. Pathological findings in particular accounted for 12 of 23 classes, which are aggregated and summarized in Table 1. They can be broadly grouped into Barrets esophagus and esophagitis in the upper GI tract, and polyps, ulcerative colitis, and hemorrhoids in the lower GI tract.

Importantly, the dataset includes 851 ulcerative colitis images which are labelled and graded using the Mayo endoscopic subscore26,27 by a minimum of one board certified gastroenterologist and one or more junior doctors or PhD students (total of 3 reviewers per image). The images are in JPEG format, with varying image resolutions, the most common being 576768, 576720, and 10721920. Table 2 shows the number of images available for each Mayo grade.

The HyperKvasir study, including the HyperKvasir dataset available through the Center for Open Science we are using here, was approved by Norwegian Privacy Data Protection Authority, and exempted from patient consent because the data were fully anonymous. All metadata was removed, and all files renamed to randomly generated file names before the internal IT department at Brum hospital exported the files from a central server. The study was exempted from approval from the Regional Committee for Medical and Health Research EthicsSoutheast Norway since the collection of the data did not interfere with the care given to the patient. Since the data is anonymous, the dataset is publicly shareable and complies with Norwegian and General Data Protection Regulation (GDPR) laws. Apart from this, the data has not been pre-processed or augmented in any way.

Two binary classification tasks were formulated from the dataset:

Diagnosis: All pathological findings of ulcerative colitis were grouped along with all other classes of pathological findings in the dataset (Fig.1a). The problem was formulated as a binary classification task to distinguish UC from non-UC pathology on endoscopic still images.

Methods (a) Overview of methods used to train diagnostic classification model of ulcerative colitis from multi-class endoscopic images on Kvasir datasets. (b) Overview of methods used to train diagnostic model for endoscopic grading of ulcerative colitis on HyperKvasir dataset.

Grading: Evaluation of disease severity using endoscopic images of UC pathology. Mayo graded image labels were binned into Grades 01 and 23. (Fig.1b) This grouping has been used in previous machine learning studies and for clinical trial endpoints19. Therefore, the task was to distinguish inactive/mild from moderate/severe UC.

A filter was designed to remove the green picture-in-picture depicting the endoscope. The filter applied a uniform crop to all images, filling in the missing pixels with 0 values, turning them black.

Source images were then normalized to [1, 1] and downscaled to 299299 resolution using bilinear resampling. Images underwent random transformations of rotation, zoom, sheer, vertical and horizontal flip, using a set seed. Image augmentation was only applied to training set images (not validation or test set), inside each fold of the fivefold cross-validation.

There are a growing variety of machine learning frameworks that could provide the foundation for our study. Our choices here acknowledge the current dominance of deep neural network methods, despite the emerging challenges of explainability (explainable artificial intelligence=XAI) and trust in practical clinical implementation41. Most of our choices use the most popular method for classifying images (convolutional neural networks), whose major differences lie in their depth of layering (50160) and recorded dimensionality of annotated relationships amongst segments of images (up to 2048).

The following four different CNN architectures were tested on the Kvasir dataset:

Pre-trained InceptionV3, a 159-layer CNN. The output of InceptionV3 in this configuration is a 2048-dimensional feature vector28.

Pre-trained ResNet50, a Keras implementation of ResNet50, a 50-layer CNN which uses residual functions that reference previous layer inputs29.

Pre-trained VGG19, a Keras implementation of VGG which is a 19 layer CNN developed by Visual Geometry Group30.

Pre-trained DenseNet121, a Keras implementation of DenseNet with 121 layers31.

All pre-trained models were TensorFlow implementations initialized using ImageNet weights32.Training was performed end-to-end with no freezing of layers. All models performed a final classification step via a dense layer with one node. Sigmoid activation was used at this final dense layer, with binary cross entropy for the models loss function.

For both classification tasks, the final dataset was randomly shuffled and split into training and validation sets in a 4:1 ratio, where 80% images were used for fivefold cross-validation and 20% unseen images were used for evaluating model performance. The best model from each fold were combined and used as the final model for prediction on the test set.

Hyperparameters were fine-tuned using Grid Search, where the search space included the following parameters: optimizers: Adam, Stochastic Gradient Descent (SGD), learning rate: 0.01, 0.001, 0.0001; momentum (for SGD): 0, 0.5, 0.9, 0.99. For all models, training phases consisted of 20 epochs with batch size of 32.

Models were evaluated using accuracy, recall, precision, and F1-scores. As a binary classification problem, confusion matrices and ROC curves were used to visualize model performance.

To provide visual explanation of what the models are learning, we chose the Gradient-weighted Class Activation Mapping (Grad-CAM) technique33. Grad-CAM produces a heatmap for each model output, showing which part(s) of the image the model is using to make predictions (produces the strongest activation). The heatmap is a course localization map produced by using gradient information flowing into the last convolutional neural network layer, to assign importance values to each neuron.

We also had an experienced gastroenterologist (D.C.B.) annotate and highlight the regions of interest in representative images to provide a comparison with the regions of interest generated by the heatmaps.

Model building was performed and figures created was done using TensorFlow and Keras packages32 in Python 3.6.9, run on Google Colab (https://research.google.com/colaboratory/) notebook.

Go here to read the rest:

Artificial intelligence enabled automated diagnosis and grading of ulcerative colitis endoscopy images | Scientific Reports - Nature.com

Related Posts