Computer Vision And Its Techniques
Computer Visions had his first start in history around the 1970’s, viewed as: “the visual perception component of an ambitious agenda to mimic human intelligence and to endow robots with intelligent behaviors” – described by R. Szeliski. At that time, the difference between this field and Digital Image Processing was that it recovers the 3D structure of the world from images and uses it towards full scene understanding.
After 40 years passed, Computer Vision is much different than it was before, it is part of our daily life, from phone camera’s recognizing our faces, to be totally possible for cars to be intelligently autonomous. The main focus in Computer Vision is to interpret images and videos in the way humans do so effortlessly, achieving that by having researchers developing new mathematical techniques to understand the way a 3-dimensional shape and appearance from an object is perceived. Vision is a difficult component in Computer Vision mainly because it is based on the premise that it is an inverse problem, so it is always trying to specify a specific solution while receiving insufficient information. These specific solutions solve difficult nature problems that have a significant result on our society. On the Statistics side of Computer Vision, probability distributions can be used to model the scene but also to select the best algorithm to better perform a certain action.
While diving into different techniques on image processing, segmentation, detection, tracking and classifying, the most accurate way to choose the right one in each topic for this project was to gather information about more than one before jumping right on to the first that appeared. The next part will focus somehow on that, on the various different topics on the process of computer vision, from processing an image to understand it in the end.
The first part of the process is called image processing, focused on editing the images that are going to be worked on in the future. There is no right or wrong processing, since there are lots of ways of processing an image and obviously all images differ, so the best way to process an image is to reverse thinking of it. If the final goal is known, the way the image needs to be processed is much easier to accomplish, for example: if cell detection is a priority, then the edges of each cell should be well perceived. Image processing can have multiple functions, it can correct exposure, balance colors, reduce image noise or increase sharpness, and it is crucial for computer vision to perform accurate acceptable results. In image processing there are point operators, in which each output pixel’s value depends on the corresponding input pixel value.
- The different type of point operators are pixel transform – a function that takes one or more input images and produces an output image.
- Color transforms – it manipulates color for a better image visualization.
- Composing and matting – it cuts an object from its original background (matting) and then uses that object on another image (composing).
- Histogram equalization – it brightens some dark values and darkens some light values by find an intensity mapping function.
There are also linear filtering processes that differ from point operators in the way that uses a collection of pixel values around the proximity of a given pixel to determine its output value. This process is responsible for tone adjustment, soft blur, sharpen details and accentuate edges (useful for cell detection), noise removal. The different types of linear filtering are:
- Padding – new boundaries from boundary effects are removed.
- Separable filtering – it separates convolutions – as vertical and horizontal.
- Band-pass and steerable filters – it smooths filters (band-pass) and convolves them with different rules and then steers the filter (steerable).
- Recursive filtering – values depend on previous filter outputs.
- Bilateral filtering – Although slow, output pixel’s value depends on weighted combination of neighboring pixel values.
- Convolution - the process of applying a filter to reduce image noise by replacing each pixel with a weighted average of its neighbors, main effect is smoothing or blurring.
The next process after image processing is segmentation which is by definition the collection of pixels or pattern elements into summary representations that show important image properties. Two different examples of segmentation are:
- Background subtraction – if anything does not seem as a background it is an important matter. On the other hand,
- Shot boundary detection – changes in video frames are important to take into account.
Image segmentation can be done by clustering pixels, clustering is a process where a dataset is replaced by clusters which are collections of data points that belong together. Each image pixel is represented by a feature vector that stores all the measurements describing a pixel (intensity, location, color). Every feature vector belongs to exactly to one cluster, each cluster representing an image segment. Image is obtained by replacing the feature vector at each pixel by the number of that feature vector’s cluster center.
Clustering Algorithms are:
- Agglomerative Clustering - each point is a separate cluster, until the clustering is satisfactory, then the two clusters are merged with the smallest inter-cluster distance.
- Divisive Clustering – a single clustering contains all points, until clustering is satisfactory, then split the cluster that yields the two components with the largest inter-cluster distance.
- Mean shift – there is one mode for each data point, modes are different but very tightly clustered, each kernel estimate should be very close to the actual set of modes, finally each pixel is replaced by its mode representation.
- Feature detection and Matching, essential component of many computer vision applications. It focuses on keypoint features/ interest points – specific locations in images and edges – object boundaries. In feature detection and matching field, point features are useful to find sets of corresponding locations in different images.
There two main approaches:
- 1st (more suitable for nearby viewpoints) it finds features in one image that can be accurately tracked using a local search technique (like correlation and least square).
- The 2nd (more suitable for large amount of motion) independently detect features in all the images, then match features based on their local appearances.
There are four important stages in keypoint detection and matching pipeline - 1st. feature detection, 2nd. Feature description, 3rd. feature matching, 4th.
Feature tracking
In Feature Detection – image is searched for locations that are likely to match well in other images although, it cannot know which other image locations will match. Usually textureless locations are nearly impossible to match but high contrast is highly detected and conclusively matched. Feature Descriptors – uses matches of detecting features that determine which features come from corresponding locations in different images, then extracts a local scale, orientation, affine frame estimate and finally use it to resample the patch before forming the feature descriptor.
Two common examples are MOPS (Multi-Scale Oriented Patches) – that compensates slight accuracies and SIFT (Scale Invariant Feature Transform) – reduces the influence of gradients far from the center.
The third one is Feature Matching – after features descriptors are extracted, it establishes preliminary feature matches between images. Uses matching strategy – which sends for further processing - conceives efficient data structures and algorithms to perform this matching the fastest way possible. In feature matching, ROC (Receiver Operating Characteristic) can be used to predict the accuracy of the match but efficient matching has a problem – it is impractical since it compares all features against all other features of potential matched images. The Solution would be indexing structures that rapidly search for features near a given feature (like multi-dimensional hashing).
Finally, Feature Tracking – finds a set of likely feature locations in a first image and then search for their corresponding locations in subsequent images but usually used in video recognition. [3]Edges are also crucial in order to perform feature detection and matching, mainly because edges occur at the boundaries between regions of different color, intensity and even texture. The best was to classify it is to say that it is a location that suffers of an extreme intensity variation.
The field feature detection and matching are also truly related to object recognition, being one of the most interesting topics in computer vision. Object recognition is the ability of a computer to understand wetter it recognizes an object or not. It can be divided into instance recognition – it recognizes 2D and 3D rigid objects, or class recognition - it recognizes any instance of a particular general class. The process of object detection holds onto some different steps, being those:
- extracting a set of interest points in each database image,
- storing the associated descriptors in an indexing structure, at recognise time, features are extracted from new image and compared against the stores object features.
If most match a second round of match verification takes place to evaluate its consistency. There is a problem though, using large databases takes time for feature matching, so to overcome this obstacle invert indexing would a good solution to perform quick findings.
The sliding window method is used for well-behaved appearance objects that do not deform. The algorithm is based on some actions – it builds a dataset of labeled image windows of fixed size (n*m), this dataset contains positive examples (large and centered instances of the object) and negative examples with none instance. Then it trains a classifier to tell these windows apart and passes every window in the image to the classifier.
The classifier is either positive if it contains the object and negative if it does not. But there is a scale problem, different sizes of the same object can trick the method, to solve that a Gaussian pyramid of the image is prepared. Search windows in each layer of the pyramid takes place. If window is scaled by ‘s’ on x*m window it searches (s*n) * (s*m) windows. Other problem is the overlapping problem – where multiple parts from one object can be positive and sent as multiple objects. The solution for this misconception would be non-maximum suppression – in which a window with a local maximum of the classifier response suppress nearby window.