Human Detection From Images And Videos: A Survey Paper
Summary of paper
The aim of human detection is to locate people in an image or video which is done by determining regions, typically smallest rectangular bounding boxes in the image or video. It is used in automatic tagging, video surveillance, autonomous vehicles etc.
Automated algorithms are still working on human detection and are not as perfect because of difficulties associated with human body and non-rigid nature of human body. Environment also plays a very crucial role in human detection as lightning and basic environmental backgrounds affect the quality of training images.
Existing algorithms decompose human detection into features and classifiers. Human detection in generally done in following ways: Input image -> candidate extraction -> human candidate -> human description -> classification -> post processing -> result.
Candidate regions are extracted with the help of windows and is called window based detection process. The size of window plays a crucial role in detection of human and can be altered at the run time of algorithm to increase the efficiency of algorithm similar to epoch rate which is also altered at run time to increase result efficiency.
The description of humans plays a very important role. Human detection description or Human Descriptors gives the description of human objects in various view points and poses. They tell about features that can be shape, appearance or motion information of object. To decide shape of human edge based features are employed. Pixel level edge based features are computed at individual pixels and then those are compared to a template modelling the human shape. This method is called template modelling, but this method has many disadvantages as this requires a number of templates for different types of body structures which makes it complex for future use. To curb the shortcomings of edge based feature extraction region level based features are used. Appearance features captures colour, texture. It is image intensity, Haar feature is an example of appearance feature. Motion features are used to differentiate one object from another if their motion pattern is different that is why it is used in object description.
Human descriptors are formed by combining features and sampling an object with a regular grid based construction. After extraction of human descriptors classification is done. Classification is done in two ways:
Generative classification: It aims to construct a model that is shape model, Structure model. Then a template matching based method is used to find results.
Discriminative classification: It used SVM (Support vector machine) for classification of humans. It classifies the human and non-human descriptions by maximizing the margin between these two classes. Linear SVM cannot be used here so a hyper Quadratic classifier is used. Clustering is done on the given samples of photos and then classes of human and non-human descriptors is formed and then SVM classifier is employed on them to find out results.
A number of datasets are available publically for evaluating human detection algorithms. For different scenarios different datasets are used.
General purpose datasets: MIT, INRIA, PENNFUDAN, USC-A, USC-C, H3D.
Surveillance purpose datasets: USC-B, CAVIAR.
Pedestrian detection datasets: CALTECH, TUD, CVC, DC, ETH.
The human detection problem is binary classification that is either region is a human or a non-human. A number of measures are proposed to evaluate algorithm for detection. In some datasets cropped human that is positive samples and non-human samples that is negative samples are given therefore evaluation can be done easily. Detection performance in these cases is represented by ROC that is receiver operator characteristics, which shows trade off between true positive and false positive rates.
TPR that is true positive rate = True recognition / Positive samples
FPR that is false positive rate = False recognition / Negative samples
Precision = TPR / TPR + FPR
The current approaches of human detection only deal in detection of human in images in which human object is in upright standing position. This assumption only helps in detection of humans in pedestrian detection and this assumption does not hold good in examples like image retrieval, video indexing and human action analysis in sports. The future aspect of human detection is fine grained human detection. Since human detection is crucial aspect in many applications therefore it is required to derive many important information's from detected human object such as human face, segmentation mask, viewpoint, pose information, 3D location of parts and many more. Such information is known as fine grained information.