Seeds and Solutions in Science, Technology, and Engineering
Department of Computer Science, Nagoya Institute of Technology, Aichi, 466-8555, Japan
As described in the previous subsection, medical imaging has played an increasingly prominent role in diagnosis and therapy since the 1960s. Exponentially increasing computer power and digital imaging technology have allowed the design of computer-aided systems, which efficiently extract information useful for diagnosis and surgery from given medical images.
Many methods for medical image analysis have been proposed since 1960, but research on CAD systems started in 1980 . Before 1980, one of the goals of medical image analysis was to develop automated diagnosis systems, which were different from CAD. The automated diagnosis systems aimed to simulate the decision-making processes of physicians, while CAD systems provide a kind of “second opinion.” Pattern recognition techniques can be used for the simulation of decision-making. Given input data, pattern recognition methods output symbols that denote the decision. The automated diagnosis systems are not clinically useful until their performances are as accurate as those of physicians.
CAD systems, on the other hand, can help physicians reading medical images by extracting only some of the image features needed for the decision-making and by displaying the extracted features [73-75]. It is important for CAD systems to
Fig. 1.9 Timeline of some of the most active research topics in computer vision appeared in Ref. . Topics with asterisks are mentioned in this subsection
decrease the amount of data to be read by physicians without affecting diagnostic accuracy. For example, a method for computing temporal subtraction images from pairs of successive whole-body bone scans, developed by Shiraishi et al., enhances the interval changes between a previous image and a corresponding current one and can help physicians to distinguish new cold and hot lesions . Only by enhancing some important image features and displaying them appropriately, CAD systems largely decrease the number of pixels the physician must carefully analyze.
The techniques for extracting useful information from given images have been developed in computer vision since about 1960. Figure 1.9, which appeared in , shows the changes in popular research topics in the field of computer vision.
Research in medical image analysis started in 1960. The main objective of the research was to extract primitives, such as edges and skeletons, that are useful for recovering physical properties of targets, e.g., the surfaces of the organs, from given images. In the 1960s and 1970s, studies aiming at realizing general-purpose vision systems were more popular than today, and many of these investigations employed bottom-up approaches. That is, the primitives extracted from given images were independent from targets to be described and were common to any vision system using this approach, which roughly correspond to Marr’s 2-1/2D sketch [77, 78], which is a rich description of object surfaces projected in given images. These processes in early stages of visual recognition are called “early vision.”
In the 1970s, blocks world [79, 80] and generalized cylinders [81, 82] handled models for representing the three-dimensional structure of objects. Using these models, the primitives extracted from given images were assembled into the descriptions of the objects. It should be noted that, different from many models used for medical image analysis today, these models were not directly applied to given images (i.e., matrixes of pixel values) but were applied to the general-purpose primitives (i.e., symbolic descriptions) extracted from the images (see Fig. 1.10). The approaches in which models of targets are directly applied to given images became more widespread in the late 1970s.
In medical image analysis, it is important to describe the organ structures, and it is necessary to determine the boundaries of the organs in given images. Detection of edges, at which the pixel values rapidly change around the organ boundaries,
Y. Masutani et al.
Fig. 1.10 Computational architecture for a general-purpose vision system appeared in Ref.  has been an important research topic. It was supposed that an edge was one of the primitives for general-purpose vision systems. Poggio et al. pointed out in 1985 that many problems in early vision, including edge detection, were ill-posed. Some prior knowledge of the targets is needed to constrain the admissible solutions of the problems and to obtain unique solutions [83, 78]. In other words, it is necessary to apply models of the targets for the extraction of the primitives and hence it is difficult to realize general-purpose vision systems with the bottom-up approach.
Pictorial structures [84, 85] are also used as models of targets and more directly applied to given images to detect the targets in the images. A pictorial structure model represents a target by a combination of components. In Fischler and Elschlager’s work , the distances between pairs of the components are represented by linking them with a spring, and the model is registered to given images by minimizing a cost function using a dynamic programming technique. The pictorial structures constrain the admissible locations of the components in given images and help the stable determination of their locations.
In the 1980s, a framework of regularization was introduced for solving the ill-posed early vision problems [78, 86]. In this framework, the early vision problems are solved by minimizing cost functions with regularization terms, which constrained the admissible solutions. Here, from a perspective of medical image analysis, active contour models (ACMs)  and active shape models (ASMs)  have played an important role. The boundaries of targets in given images can be determined by registering these models to the images. In the registration, cost functions with shape regularization terms are minimized. The regularization terms in ACMs represent the smoothness and/or the curvedness of the boundaries. The relationships between ACMs and some popular edge detectors such as those proposed by Marr and Hildreth  and by Canny  are described in Kimmel and Bruckstein’s work . In ASMs, the shapes of the boundaries are regularized by using statistical shape models (SSMs) constructed from sets of training data of target shapes. The model shapes are limited by constraining them on the subspace constructed from the training data. The ASM is one of the most fundamental methods for segmenting organ regions in medical images [91-93]. Figure 1.11 shows examples of such training data for an SSM of the liver . The approaches that use the SSMs for the region segmentation do not aim to construct general- purpose vision systems but to construct specific-purpose vision systems, e.g., CAD systems. Not only the generalization ability but also the specification is employed for evaluating the performance of the SSMs . The models used for determining the boundaries are required to represent only those of the specific targets. Analogous to the regularization approaches, Bayesian ones can constrain the admissible solutions and have been more widely employed for solving the early vision problems . For example, the SSMs used for region segmentation represent the prior probability distributions of the shapes of the target boundaries, and the specification evaluates the accuracy of the prior distributions constructed from the training data; an SSM with better specification corresponds to an accurate prior probability distribution, where the contours that have specific shapes peculiar to the target appear with higher probabilities.
Fig. 1.11 Examples of training data for constructing an SSM for the liver (Ref. ). A set of corresponding points are indicated on each of the surfaces
Markov random fields (MRFs) were employed for the representation in CV in the 1980s [96-98]. For example, Geman and Geman proposed an image segmentation method that used an MRF model for representing the distribution of pixel labels and that segmented given images into regions by inferring the label for each pixel on the MRF model . A Gibbs sampling technique was used in  for the inference, but graph-cut techniques  were later widely employed for inference on MRF models; if the MRF models satisfy some conditions, the optimal solution is obtained by using the graph-cuts or other MRF inference algorithms such as belief propagation [100, 101]. Many methods of energy-based segmentation in the 1990s [102,103] segmented images by minimizing energy functions which can be derived from the MRF models. Pairs of an MRF model and a graph-cut technique are also one of the most fundamental tools for image region segmentation today. ASMs can be applied for determining the boundaries of the organs, followed by graph- cut techniques for improving the precision of the segmentation. Specifying targets can largely improve the performance of vision systems. It looks seriously difficult to construct the general-purpose vision systems.
The goal of vision systems, including those for medical applications, is to generate compact descriptions of targets from given image data by using models: Given image data x, you describe the state of the world, w, by inferring w by using models that represent the target world and the relationships between x and w. The image data, x, consists of many pixel values, and the compact descriptions, w, consist of a smaller number of numerical values or symbols. The inference is called “recognition” when the descriptions, w, are discrete and is called “regression” when they are continuous .
One difficulty that is specific to computer vision comes from the fact that, in many cases, a large portion of each given image is not useful for the inference. An image consists of many voxel values and not all these values contribute to the inference. For example, to recognize characters in an image, one should first detect the locations of the characters in the image, and only the local images around the detected locations should be processed for the inference of the character codes. Only the local images contribute to inferring the character codes, and other large portions of the given image, the backgrounds, do not contribute to the inference. One has to select appropriate portions of pixels in given image data before the final inference is processed. Such pixel selection is not easy, especially when the targets to be inferred and the backgrounds to be excluded can have varieties of appearances in images, because it is not easy to find features of appearances that can distinguish between the targets and the backgrounds.
The variety of images of a target entity also makes the inference in computer vision difficult. The appearance of a single 3D object, of a human face, for example, largely varies depending on its relative pose and distance with respect to the camera and on the lighting conditions. Occlusions also cause large varieties of appearances. The variety among multiple target entities that should be described by an identical description, w, also makes the problems difficult. For example, a system of human face detection should be able to detect the face of any person. These large varieties require high degrees of freedom (DoF) of the models to represent the appearances. The processes of the inference are more complex and the accurate construction of the models is more difficult when the DoF is higher.
For solving the abovementioned problems, a computer vision system, in general, processes given images successively; at each stage, information useful for the next process is extracted from the descriptions output by the previous process, and the newly extracted information is described and is input to the next process. Roughly speaking, there exist two approaches for designing such systems, the bottom- up approach and the top-down approach. Today, top-down approaches are often employed for realizing the practical applicability of vision systems. In the top- down approaches, a model that specifically represents the global aspects of each target is introduced and the processes at earlier stages are also specifically designed for each target. The data, which are extracted at consecutive multiple stages in a bottom-up system, are often extracted by one process at a single stage. For example, local primitives are simultaneously extracted, while a global shape model of a target object is registered to a given image. Most medical image processing systems are also designed using top-down approaches.
Medical image processing is different from other image processing in the following aspects:
- 1. Input images are, in many situations, three-dimensional.
- 2. Various kinds of targets are included in a given image, the targets are located close together, their shapes are extremely varied, and they interact with each other.
- 3. It is vital to determine the location and the boundary of each anatomical target accurately.
The first aspect, the difference in the dimensions, means that medical images contain a more number of voxels than pixels in conventional two-dimensional images, and the second aspect makes the identification of the boundary of each target (e.g., a target organ) difficult. The third aspect, the crucial importance of the boundary determination, characterizes the research field of medical image processing. Many computer vision systems that have been realized today do not need to accurately determine the boundaries of targets with such a high degree of accuracy. Character recognition systems do not detect the contours of character strokes, face recognition systems do not detect exact face boundaries, and image searching systems do not detect the boundaries of targets in images. Some of the systems, such as character recognition and face detection systems, determine a bounding box of each target, and the appearance in the bounding box is directly input to the final process for obtaining the final description of the targets. Others, such as image searching systems, compute histograms of some image features, e.g., the “bag of features,” and do not explicitly use any information of the boundaries of targets for the final inference. The need for highly accurate determination of object boundaries is one of the bottlenecks in the realization of the medical computer vision systems’ practical applicability.
One can classify the top-down methods of image segmentation into two categories: In one category, the values of the parameters of the target global models are inferred by regression. In the other, a label of each pixel that denotes a target ID number is inferred by recognition. One of the main methods included in the former category is the registration of the models to given images, which include the ASMs. As described above, the ASMs are statistical models that represent the global shapes of targets and are generated from training data of the targets. The variety of the global shapes is represented by a set of shape parameters, w, and the values of w are inferred so that the resultant curves (or surfaces) are located along the boundaries of the targets. One of the main methods in the latter, the recognition category, is voxel label inference with the minimization of cost functions. A set of labels that denote the target IDs is defined and a label is inferred for each voxel in the given images. The cost functions quantitatively evaluate the appropriateness of the labels of voxels based on the voxel values in the given images and on the label combinations between neighboring voxels.
Both recognition and regression have been well studied in the research field of machine learning, which is one of the most popular research topics for computer vision today, as shown in Fig. 1.9. The progress of learning algorithms makes it possible efficiently to construct models with higher DoF, which can represent statistical properties of targets more accurately, from a set of larger number of training data. For example, learning algorithms for deep neural networks have been proposed [105, 106], and the resultant networks have demonstrated state-of-the-art performance of pattern recognition and of regression in many applications. As the number of medical images available for learning is monotonically increasing, the role of learning in medical image processing is becoming more important. It should be noted, though, that the statistical properties derived from training data are not the only knowledge that can be used for medical image processing; such knowledge can be provided from many existing fields, e.g., anatomy, pathology, and physics [107, 108].
Computational anatomy is a system of automatically generating medical descriptions of patients from their medical images: The system consists of the algorithms for generating the descriptions and the models used in the algorithms. In the computer vision research field, algorithms for generating the descriptions of given images and models used in the algorithms are studied, and their progress improves the algorithms for medical image analysis and accelerate the development of computational anatomy. Among the large variety of research topics of computer vision, image segmentation is one of the most fundamental ones and its improvement is vital for CAD systems. It largely helps physicians to accurately and automatically describe each anatomical structure in a given medical image. As described above, image segmentation needs models of targets, and hence the segmentation of anatomical structures needs computational anatomical models. This is why the algorithms for the registration of the models to given images are more intensely studied in medical image analysis research. Though the computational anatomical models are useful only for segmenting medical images, the medical image segmentation is not isolated from other research fields. The progress of machine learning and statistical inference, for example, can directly improve the construction of statistical anatomical models and the performance of model registration, and the progress of computational anatomy can contribute to the development of computer vision, especially through the development of new algorithms and models for accurately and stably describing target regions.