Diseño e implementación en hardware reconfigurable de un sistema de reconocimiento de gestos de la mano basado en visión por computador

  1. Francisco Javier Toledo Moreo
Supervised by:
  1. Isidro Villó Pérez Director
  2. Rafael Toledo Moreo Director

Defence university: Universidad Politécnica de Cartagena

Fecha de defensa: 03 February 2023

  1. Ignacio Bravo Muñoz Chair
  2. José Santa Lozano Secretary
  3. Mercedes Valdés Vela Committee member

Type: Thesis


In this thesis, a system for hand gesture recognition based on computer vision and the design of its hardware implementation are proposed. The purpose of gesture recognition is to provide a computer with the ability to detect gestures made by a person. This task, innate to humans, has proven to be complex and difficult to automate. Among the different approaches to the problem, one of the main lines of work is the use of computer vision. The development of computer vision processing techniques has provided tools for microprocessor–based systems to analyze images acquired by cameras and try to extract from them information of interest for any application. Analyzed from this perspective, hand gesture recognition is an object recognition problem, a field in which two levels can be distinguished: instance level, when looking for a specific object, a specific person; and category level, when trying to recognize any instance of a type of object. This second level aims, when a collection of object categories is defined and given an image, to determine if there is any object of a category present in it. In particular, in this thesis the category is a gesture, defined by a certain position and orientation of the hand and by the configuration of the fingers. In this framework, a collection of categories —a gesture library— that is intended to be recognized has been defined and, with such an objective, a set of processing steps and algorithms that make up the hand gesture recognition system has been developed. First, it is intended to separate the hand from the rest of the image. For this purpose, a skin color recognition algorithm is proposed, based on models built in different color spaces. Developed for the aforementioned purpose, it may be of interest in any of the numerous applications where skin color–based image segmentation is carried out. Once the image is segmented, it is proposed to detect the hand and recognize the gesture by identifying its elementary parts—palm and fingers—by means of two–dimensional convolution of the segmented image with a set of templates defined for that purpose. From the analysis of the information resulting from the convolutions of these templates with the images of a gesture database, a model has been constructed for each of the gestures in the library. In the development process of the different stages, the design methodology has sought to favor modularity and scalability sufficient to enable the updating of the gesture library and the adaptation of the overall functioning of the system to different applications. In order to provide the user with a satisfactory experience in the operation of the recognition system, it is essential that the interaction is carried out as naturally as possible. This requires that the user perceives that the system responds immediately to his or her actions, which implies that the speed of the system’s response is a key performance indicator. In order to optimize the temporal performance of the execution of processing algorithms, solutions based on reconfigurable hardware were explored. FPGA devices are a suitable platform for accelerating computationally intensive algorithms. Their internal structure makes them ideal for exploiting the pixel–level parallelism inherent in low–level image processing algorithms, also instruction–level parallelism through pipeline segmentation and, at the same time, higher–level parallelism for the simultaneous execution of different operations. For all these reasons, FPGAs are the proper hardware platform for the implementation of our system. Using Xilinx R devices and tools, we have designed, implemented, and validated a digital system that executes the processing tasks involved in gesture recognition, in the framework of a hybrid hardware/software architecture. The partitioning criterion has been the time scale of the tasks, in which two levels are distinguished: pixel level and image level. For resolutions and image sensors typical of embedded systems, the algorithms that operate on pixel values do so on the order of nanoseconds. Their home domain is hardware, where it is possible to exploit the parallelism of operations and the flexibility of the FPGA architecture to achieve real–time processing. On the other hand, image–level tasks, in the order of milliseconds, should be executed in software. Within the designed digital system, this thesis develops solutions for the hardware implementation of the two most relevant pixel–level tasks: skin color segmentation and two– dimensional convolution. In particular, for convolution, which is the most computationally intensive step, architectures are proposed both for the performance of the operations involved in its computation and for the temporal storage of the data. The results obtained in the different test campaigns demonstrate both the goodness of the proposed solution to the computer vision problem and the feasibility of its implementation by means of FPGA devices