Review Paper On Hand Gesture Recognition

In the recent years, human computer interaction is becoming a vital part of most state-of-the-art technologies. The traditional mode of interaction via keyboards, mouse and joystick cannot meet the demands of this fast growing technology, and hence, in this paper, hand gesture recognition is studied and explained, to enable further development of natural communication between humans and computers. Different methods and algorithms that have previously been used in hand gesture recognition projects have been analysed and compared with respect to their advantages and drawbacks. Ongoing research challenges are also enlightened upon. Finally, a proposed methodology is discussed to show its increased efficiency in processing images, skin colour detection and hand gesture recognition.

KEYWORDS: Hand Gesture Recognition, Hidden Markov Model (HMM), Convoluted Neural Networks (CNN), Feature Extraction, Skin Colour Detection, Classification, Human Computer Interaction (HCI)

Introduction

Traditionally, users needed to tie up themselves with the help of wire in order to connect or interface with the computer system. In wired technology user cannot freely move in the room as they connected with the computer system via wire and limited with the length of wire. Instrumented gloves also called electronics gloves or data gloves is the example of wired technology. These data gloves provide good results but they are extremely expensive to utilise in wide range of common application. Recently, some advanced image based techniques have been introduced that require processing of image features like texture and colour. The purpose of this project is to improvise natural interaction between humans and computers so that the recognised hand gestures can be used to convey meaningful information. We humans communication not just with our words, but also with our gestures. With the recent development in computer vision and human computer interaction, we can create a system that is capable of identifying hand gestures and then performing suitable actions like moving the cursor on the desktop, opening certain applications, allowing smooth viewing and reading of PDFs, and also managing certain display settings. We can define different positions or specified sequences of our hand movements as the hand gesture that our computer should recognise. Gestures may be static - requiring less computational complexity, or dynamic, which are more complex and also more feasible for real time systems. To exploit the use of gestures in HCI, it is important to provide the means by which they can be interpreted by the computers. There are two main characteristics that should be deemed when designing a HCI system, as mentioned: Functionality and Usability. System functionality refers to the set of functions or services that the system equips the user to perform, and system usability refers to the level and scope under which the system can operate and perform specific user purposes efficiently. We will be discussing different technologies and methods like Convoluted Neural Networks, OpenCV, Hidden Markov Model, background subtraction to solve this problem. A system that recognises hand gestures can be used in various applications like sign language interpretation, computer control games, and human robot interaction.

Related Work

The essential aim of building hand gesture recognition systems is to augment interaction between humans and computers. HCI is also called Man-Machine Interaction (MMI), which refers to the relation between the human and computer, or more precisely the machine. Many methods have been proposed for acquiring the different information required for a hand gesture recognition system. Some methods use data glove devices and colour markers for simplified feature extraction of the gesture to be recognised. Other methods that are adopted use the appearance of skin colour on the hand to help extract the relevant features and classify the gesture into the defined postures/poses. Skin colour detection is used in the image processing sections, and the features extracted are fed to neural networks to correctly classify them and thus perform the required action. The most commonly used algorithm for hand gesture recognition include the Hidden Markov Model (HMM), which is based on statistics, algorithms based on genetic algorithms and artificial neural networks. Many techniques on Histogram of Gradient (HOG) have been proposed in the past which edge and gradient based receptors for hand gesture recognition. RGB-to-GRAY segmentation techniques have been used in different sign language interpretation systems. Certain authors have proposed a method of improvised Scale Invariant Feature Transform (SIFT) to extract features. Authors of [5] introduced a method based on hand characteristic curves and the result of the combination of colour, edge and motion information. The elastic curve matching method introduced by paper, was less dependent on segmentation. Additionally, paper created an algorithm that could work in complex backgrounds. It used the maximum difference features to classify the gestures after the segmentation of the Most Discriminating Feature (MDF) space. Paper [5] used motion and colour features to detect and track a hand, and combined the methods of template matching and nearest neighbour classifier to recognise the hand gesture.

Proposed Methodology

System Requirements: A low cost computer vision system that can be executed in a common personal computer equipped with a USB camera is the main component System should work under different conditions of background complexity and illumination.

Description of the Proposed Algorithm: Aim of the proposed algorithm is to create an efficient hand gesture recognition system that can work well for various static as well as dynamic gestures and activate different functionalities on the computer the program is being run on. The proposed algorithm is an amalgamation of the most effective methods used by various researchers and fellow hand gesture recognition system developers. It consists of the following steps:

Step 1: Creating our Data Set: We will be creating a data repository of gesture inputs with the help of a webcam. Several times in a minute, we will capture frames of the input and store and process it locally. This step eliminates the requirement of a hand glove, which is an expensive method, and also can prove to be a source of discomfort to the user. The rate of capturing images will be approximately 500 per minute.

Step 2: Defining the essential Hand Gestures: Hand gestures are expressive meaningful expressions of humans, involving physical movements of the fingers, palms and arms. They allow us to convey information to the computer and interact with the environment. Here we will classify hand movements into two major classes: Hand gestures Static postures (one pose or configuration of fingers and hand)Dynamic gestures (sequence of actions performed) Unintentional movements From all the images that we have captured in our repository, we will now classify them in different folders, one for each hand gesture to be recognised.

Step 3: Image Processing: The analysis of this input evolves in two sequential tasks. The first is the extraction of features from the raw image, and the second is computing the model parameters and essentially classifying the image. Further, skin colour detection plays an important role in the image processing. Our steps for image processing are as follows: Localization - of the person performing the hand gesture from the rest of the image background. In order to allow smooth segmentation, it is necessary to determine the absolute position and orientation of the user.

Segmentation - Since, efficient hand tracking and segmentation is the key of success towards any gesture recognition, due to challenges of vision based methods, such as varying lighting condition, complex background and skin color detection; it requires the robust development of algorithm for natural interface. Segmentation will be our initial stage for the recognition process in which the acquired image will be broken down into meaningful regions or segments. The segmentation process is only concerned with partitioning the image and not with what the regions represent. Segmentation subdivides an image into its constituent parts, the level of which depends on the problem being solved. Color is a very powerful descriptor for object detection.

In our case we convert a RGB image or gray scale image into binary (Black and White) image. Black which represents the background and white which represents our hand. The two main approaches to segmentation are:

  1. Pixel-based or local methods, which include edge detection and boundary detection.
  2. Region-based or global approaches, which include region merging and splitting, and thresholding.

We use Otsu algorithm to convert image into binary. It is an unsupervised and non-parametric method of segmentation which can select threshold automatically and do segmentation. Otsu's thresholding method involves iterating through all the possible threshold values and calculating a measure of spread (for example variance) for the pixel levels each side of the threshold, i. e. the pixels that either fall in foreground or background. The aim is to find the threshold value where the sum of foreground and background spreads is at its minimum. Morphological filtering - If we take a close look at the segmented image after applying the Otsu algorithm on the original grayscale or RGB image we find that the segmentation is not perfectly done. There are still some background parts which contain 1s and some hand parts which denote 0s. These errors can lead to a problem in contour detection of hand gesture and reduce the system efficiency; so we need to remove these errors.

So, morphological filtering is necessary to be applied on segmented images to get a better smooth, closed and contour of a gesture. There are many useful operators defined in mathematical morphology. They are dilation, erosion, opening and closing. Dilation is a transformation that produces an image that is the same shape as the original, but is a different size. Dilation stretches or shrinks the original figure. Dilation increases the valleys and enlarges the width of maximum regions, so it can remove negative impulsive noises but do little on positives ones. Erosion is used to reduce objects in the image and is known that erosion reduces the peaks and enlarges the widths of minimum regions, so it can remove positive noises but affect negative impulsive noises little. Opening (eroding first and then dilating) can remove small bright spots (i. e. “salt”) and connect small dark cracks. Closing (dilating first and then eroding) can remove small dark spots (i. e. “pepper”) and connect small bright cracks.

Step 4: Feature Extraction: Feature extraction is part of the data reduction process and is followed by feature analysis. One of the important aspects of feature analysis is determining exactly which features are important. Feature extraction is a complex problem in which the whole image or the transformed image is often taken as the input. The goal of feature extraction is to find the most discriminating information in the recorded images. Feature extraction operates on two-dimensional image arrays but produces a list of descriptions or a feature vector. Mathematically, a feature is an n-dimensional vector with its components computed by some image analysis. The most commonly used visual cues are colour, texture, shape, spatial information, and motion in video. For example, colour may represent the colour information in an image, such as colour histogram, colour binary sets, or colour coherent vectors. The selection of good features is crucial to gesture recognition because hand gestures are rich in shape variation, motion, and textures. Although hand postures can be recognized by extracting some geometric features such as fingertips, finger directions, and hand contours, these features are not always available and reliable because of self-occlusion and lighting conditions. Moreover, although a number of other non-geometric features are available, such as colour, silhouette, and textures, these features are inadequate for recognition. Explicitly specifying features is not easy. Therefore, whole images or transformed images are taken as input, and features are selected implicitly and automatically by the classifier.

Step 5: Gesture Recognition: For hand gesture recognition, we will be using the popular learning model - Hidden Markov Model. There are two most popular methods for this step, namely: Hidden Markov Model: If in a system the conditional probability density of a particular event depends only on a fixed number of past events, say N, then system is said to be following Markov model. If N is 1 then we call the system first order Markov model. There are a number of variants of Markov model, like Hidden Markov model (HMM), which is what we can use in the hand gesture recognition system. A typical Hidden Markov model involves: A set of states, Denoted by be S. Initial and Final states, denoted by be SI, SF. A matrix representing probability of transition from one state to another state. An output probability matrix which capture the probability of getting a particular output. Mainly double processes are performed in HMM model. Initial one is to create Markov chain with limited number of states and the second is an arrangement of arbitrary capacity related with one state. Corresponding to current state observation symbols is generated.

According to the random functions this symbol is generated for one process. Probabilities pair is defined as follows for each transition between all the states. Probability identified with change: which make accessible the likelihood for experiencing the progress; Probability identified with yield: which depicts the confined likelihood of delivering a yield image from a limited letter set when given a state For a more detailed reference on theory, computation and application of HMM in hand gesture recognition, the readers are referred to []. Convolutional Neural NetworksConvolutional Neural Networks are very similar to ordinary Neural Networks. They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function on the last (fully-connected) layer. ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture.

These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network. A filter is just a matrix of values, called weights, that are trained to detect specific features. The filter moves over each part of the image to check if the feature it is meant to detect is present. To provide a value representing how confident it is that a specific feature is present, the filter carries out a convolution operation, which is an element-wise product and sum between two matrices. When the feature is present in part of an image, the convolution operation between the filter and that part of the image results in a real number with a high value. If the feature is not present, the resulting value is low. The result of passing this filter over the entire image is an output matrix that stores the convolutions of this filter over various parts of the image. The filter must have the same number of channels as the input image so that the element-wise multiplication can take place.

For instance, if the input image contains three channels (RGB, for example), then the filter must contain three channels as well. The output of the convolution operation between the filter and the input image is summed with a bias term and passed through a non-linear activation function. The purpose of the activation function is to introduce non-linearity into our network. Since our input data is non-linear (it is infeasible to model the pixels that form a handwritten signature linearly), our model needs to account for that. To do so, we use the Rectified Linear Unit (ReLU) activation function. Values that are less than or equal to zero become zero and all positive values remain the same.

Conclusion and Future Work

Hand gesture recognition is a method to allow seamless interaction between humans and computers, and that is the actual future of technology. We must ask ourselves what we can do to improve and maximize the accuracy and robustness of this system. More stress should be given to build up a dynamic gesture recognition system which gives satisfactory performance. Hand gestures differ from person to person, and culture to culture. The interpretation of a certain hand gesture could mean different things to different users, and this is why more research into how we can classify the gestures accurately must be done. Without a doubt, there is always room for improvement in algorithms that are designed to learn by themselves. One of the major challenges we face in this system is the training time required for different models for recognising the hand gestures. Oftentimes, the memory requirements also increase manifold as the training data increases, and this causes a delayed result at run time. There is more scope in finding better hardware for data collection. Also, upon analysis, the HMM technique was found to be more accurate than the neural network and instance based learning models. Further research must be pursued in these fields to make the system more efficient.

29 April 2020
close
Your Email

By clicking “Send”, you agree to our Terms of service and  Privacy statement. We will occasionally send you account related emails.

close thanks-icon
Thanks!

Your essay sample has been sent.

Order now
exit-popup-close
exit-popup-image
Still can’t find what you need?

Order custom paper and save your time
for priority classes!

Order paper now