Direct visual servoing with respect to rigid objects

Existing visual servoing techniques which do not need metric information require, on the other hand, prior knowledge about the object's shape and/or the camera's motion. In this paper, we propose a new visual servoing technique which does not require any of them. The method is direct in the sense that: the intensity value of all pixels is used (i.e. we avoid the feature extraction step which introduces errors); and that the proposed control error as well as the control law are fully based on image data (i.e. metric measures are neither required nor estimated). Besides not relying on prior information, the scheme is robust to large errors in the camera's internal parameters. We provide the theoretical proofs that the proposed task function is locally isomorphic to the camera pose, that the approach is motion- and shape-independent, and also that the derived control law ensures local asymptotic stability. Furthermore, the proposed control error allows for simple, smooth, physically valid, singularity-free path planning, which leads to a large domain of convergence for the servoing. The approach is validated through various results using objects of different shapes, large initial displacements as well as large errors in the camera's internal parameters.


I. INTRODUCTION
Visual servoing consists in controlling the motion of a robot through the feedback of images [1].Visually servoed systems can then be viewed as regulators of an appropriate task function [2].This article considers the task functions which can be constructed from the current and the reference images, i.e. the teach-by-showing approach.Furthermore, we focus on techniques that do not use metric information about the observed target.Unfortunately, existing methods which fall within this class require prior knowledge of the object's shape and/or of the camera's motion.Indeed, the technique proposed in e.g.[3] (as any other method which relies solely on the Essential matrix), although not requiring an explicit metric model of the object, requires a nonplanar target as well as a sufficient amount of translation to be carried out in order to avoid the degeneracies.With regard to the technique proposed in [4], although also not requiring metric information, it is designed for planar targets only.We remark that even for image-based visual servoing approaches e.g.[5], minimal metric knowledge (the depth distribution) is necessary to provide a stable control law [6].The 2.5D visual servoing strategy [7] was then proposed to enlarge that domain of stability.However, it requires a coarse Ezio.Malis@sophia.inria.frmetric estimate of the normal vector of the planar target, in order to decide between the two possible solutions of the reconstruction.Another alternative to augment the domain of convergence is to perform a suitable path planning [8].However, this latter method, when applied to unknown targets (our objective), also requires this coarse metric estimate to accomplish its first phase.
In this work, we propose a new visual servoing technique which does not require or estimate any metric information about the observed target.The proposed control error as well as the control law are fully based on image measurements.We provide the theoretical proof that the control law, which is extremely simple to compute, ensures local asymptotic stability for the servoing.The control error proposed in this article generalizes [4], which is designed for planar surfaces only.In fact, the proposed method is independent of the object's shape and of the camera's motion.Moreover, another generalization concerns well-established techniques such as [9], which performs a partial Euclidean reconstruction.Our projective formulation also naturally encompasses this latter solution.The theoretical proof of all of these statements is provided.In addition to these attractive generalizations, other improvements are achieved as well.The proposed control error is locally isomorphic to the camera pose, and is also injective around the equilibrium for the entire domain of rotations.The theoretical proof of this isomorphism is also provided.Furthermore, another important strength of our control error is that it allows for simple, smooth, physically valid, singularity-free path planning.This procedure can considerably enlarge the domain of convergence of the visual servoing.
Another remarkable difference between the proposed approach and those previously mentioned (except for [4]) is how image information is exploited: these latter techniques are feature-based (e.g. points, lines, circle, etc.).This means that a sparse set of carefully chosen, distinct features is firstly extracted in both current and desired images.Correspondences are established afterward based on descriptors together with a robust matching procedure.Here, we directly use the intensity value of all pixels [10].Therefore, higher accuracy is achieved since noise is not introduced (there is no feature extraction process) and much more information is exploited.Our direct visual tracking method [11] is coherent with the proposed control law in the sense that it does not make any assumption about both the object's shape and the motion carried out by the camera.That is, the tracking procedure likewise does not require any of these prior knowledges.In fact, the same set of parameters used by the tracking procedure is also used by the control law.Indeed, we strongly believe that both vision and control aspects are intrinsically coupled processes, and are treated here as such.This represents a rupture of paradigm with respect to the vast majority of existing visual servoing techniques to date, where feature extraction process and control computation are formulated separately.Although conceptually appealing, this latter uncoupled, feature-based framework presents some relevant drawbacks.For example, global constraints are not easy to embed into feature correspondence algorithms [12], such as the fact that large portions of the scene move with a coherent rigid motion, or that the appearance changes due to motion of the scene relative to the lights.Attempts to impose these constraints are usually performed a posteriori within this framework.On the other hand, the rigidity of the scene and the robustness to lighting changes can be effectively incorporated within direct methods, see e.g.[13], [14].
The proposed approach is validated through various results using objects of different shapes, large initial displacements, large errors in the camera's internal parameters, as well as path planning examples.

A. Notations
Consider a 3D point projected in the reference image I * as a pixel with homogeneous coordinates p * ∈ P 2 .Then, its intensity value is denoted by I * (p * ).After displacing the camera by a translation t ∈ R 3 and a rotation R ∈ SO(3), another image I is acquired.That displacement can be represented in homogeneous coordinates as The rotation is parameterized here by the axis of rotation u ∈ R 3 : u = 1 and angle of rotation θ, i.e.R = exp([uθ] × ), where [r] × denotes the anti-symmetric matrix associated to the vector r.We also follow the usual notations v, v , v , v to represent respectively an estimate, a modified version, the transpose, and the 2-norm of a variable v.Moreover, 0 denotes a matrix of zeros of appropriate dimensions.

B. Two-view projective geometry
Projective geometry is an extension of the Euclidean geometry, which describes a larger class of transformations than just rotations and translations, including in particular the perspective projection performed by a camera [15].In this general framework, corresponding image points p ↔ p * are related by where the morphism G ∈ SL(3) includes the homography at infinity G ∞ and the epipole e p ∈ R 3 in I, while ρ * ∈ R is the projective parallax with respect to a (in general) virtual plane represented in the image I * by the vector q * ∈ R 3 .Indeed, the Lie group , where an homogeneous element of the latter is represented here by From Eq. ( 2), a warping operator w( • ; Q, ρ * ) : P 2 → P 2 can thus be defined:

C. Direct visual servoing w.r.t. planar objects
The homography-based 2D visual servoing technique proposed in [4] has as objective to control the motion of a camera with respect to a planar object.For this, consider that this object lies on the plane Π, which is not reconstructed either off-line or during the servoing.The method is in fact based on the recovery of the projective homography G Π induced by this plane between two views.Indeed, given an estimate of the camera's internal parameters K, then G Π can be transformed into an Euclidean homography through as well as a chosen image point (also called control point) can be normalized by The task function e Π = e ν Π , e ωΠ ∈ R 6 with is then proven [4] to be locally isomorphic to the camera pose.However, it is non-injective around the equilibrium if the entire domain of rotations is considered, since both θ = 0 and θ = π are mapped to by e ω Π = 0.
Remark 1.It is easy to verify that G Π is a particular case of the general morphism G in (2) for planar targets.In this case, the 3-vector q * is the representation of the plane Π = [n * , −d * ] in the pixel coordinate system.i.e. q * = K − n * .Of course, given that p * is the projection of a 3D point belonging to Π, the parallax from p * w.r.t.Π is ρ * = 0.

III. THE GENERAL DIRECT VISUAL SERVOING
This section presents an unified framework where the object can be either planar or non-planar regardless the motion carried out by the camera.Moreover, the proposed task function generalizes previous well-established formulations.

A. The unified direct visual tracking
In order to recover the parameters which relate the projection of an object between two views, the Fundamental matrix could be estimated [15].However, it is not defined for planar scenes, and if the camera undergoes a pure rotation motion the estimation of the translation is degenerate.Hence, a robust method e.g.[16] to detect those degeneracies and to switch between models (affine Fundamental matrix, homography, etc.) has to be used.Alternatively to using these feature-based, multiple hypotheses testing methods, we use a more general visual tracking technique which exploits all image information within a region of the reference image R * ⊆ I * [11].The assumption of this tracking procedure is that the camera observes a continuous textured surface.In this case, an efficient second-order optimization technique is applied to minimize directly the intensity discrepancies between the current and the reference images.That is, we seek the parameters Q and ρ * (see Subsection II-B) that minimize the following cost function: The obtained set of parameters can be used as an estimate for the same procedure when a new image is acquired.
With regard to those degeneracies, a conservative solution to recover these parameters is applied in [11].That is, for every new image an initial attempt to explain the image differences is performed by using Q only.Afterward, the remainder is corrected with the parallaxes ρ * .Here, we adapt the technique proposed in [13] to our projective formulation.Thus, ρ * is only used (and whenever used this is made simultaneously with Q) to explain the image discrepancies if and only if the difference between resulting cost values from the (image) optimizations exceeds the image noise.This minimal parameterization framework presents many strengths.First, in the case that Q has already explained most of those discrepancies, by including the parallaxes afterward would perturb the estimate of Q in the next iteration.Furthermore, once the optimal parallaxes are obtained, there is no reason to maintain them in the minimization procedure since we deal here with rigid objects only.
Remark 2. The tracking method does not use any prior information either about the object's shape or about the camera's motion.In fact, adapting the strategy proposed in [13] to our context leads to leaving unaltered the initial parallax values ρ * 0 = 0 either if the object is planar or if the camera undergoes a pure rotation motion.This happens because, in these cases, Q solely explains all image discrepancies.

B. The proposed task function
From the parameters Q and ρ * estimated by the visual tracking method (see Subsection III-A), we propose in this subsection a suitable task function for positioning the robot from an initial pose to the reference (desired) one.Indeed, using Eq. ( 3) and an estimate of the camera's internal parameters K permit to obtain and These entities together with the control point m * (6) will be used for defining the translational part of the task function.
Next, define both a "projective" axis of rotation µ ∈ R 3 (which does not have necessarily unit norm) from H (9) as and a "projective" angle of rotation as where tr(•) denotes the trace of a matrix, and the function sat : R * → [0, 1] is defined as so that arcsin(•) is a real-valued function.Of course, if µ = 0 then µ can be chosen arbitrarily.

Theorem 1 (Task function and isomorphism). The task function
is locally isomorphic to the camera pose.Moreover, it is injective around the equilibrium for the entire domain of rotations, since only θ = 0 is mapped to by e ω = 0.
Proof.The proof of this isomorphism is presented in [17].

Remark 3.
A very important note about the task function defined in ( 14) is that it is constructed from projective entities only, i.e. without measuring or requiring any metric information about the object.This also means that robustness to camera's internal parameters is achieved.
Remark 4. Since the epipole is computed in the tracking process, we could use it solely to construct a decoupled translation error, e.g. by defining instead of that defined in (14).The translation error ( 15) is decoupled from the rotation motion since e p = K −1 e p = K −1 K t = t (neglecting errors in K).However, if the object is planar then one is not sure if the recovered e p corresponds to the true solution (because more than one admissible solution does exist).Nevertheless, the coupling in ( 14) is not a major concern to the stability because a path planning is performed (see Section IV).In addition to the possible modification (15), we could also have defined the rotational error differently, such as: to replace H Π − H Π as defined in (7), in our general, unified framework.However, remarkable improvements are achieved through e ω as defined in (14), which are stated in Corollary 1.

Corollary 1 (Generality and improvements).
The proposed task function e = e ν , e ω ∈ R 6 in ( 14) is a generalization of the one e Π = e ν Π , e ωΠ ∈ R 6 defined in (7) for coping with objects of arbitrary shape.Moreover, the proposed control error allows for a straightforward path planning (shown in Section IV).Furthermore, our projective formulation naturally encompasses the hybrid control error e Π = e ν Π , e ω Π = (αm − m * ) , θu ∈ R 6 , α > 0, proposed in [9], which requires a coarse metric estimate of the normal vector of the planar target for recovering e ω Π .
Proof.The proof of these statements is presented in [17].

Theorem 2 (Local stability). The control law
where v = ν , ω ∈ R 6 comprises the translational and rotational velocities and the control error e = e ν , e ω ∈ R 6 as defined in (14), ensures local asymptotic stability provided that the point m * is chosen such that its parallax relative to the dominant plane of the object is sufficiently small.
Proof.The proof is presented in [17].
With respect to the parallax condition in Theorem 2, there always exist a point m * which has zero parallax (and thus can be chosen as the control point) since, in the formulation, the dominant plane 1 always crosses the object.Therefore, the closed-loop system is always locally asymptotically stable.However, for robustness reasons, it is convenient to choose a point close to the center of the object.

IV. PATH PLANNING
Although the method is robust to large camera calibration errors, it is desirable that the trajectory of the control point in the image be as closely as possible to a straight line.With this, a large domain of convergence for the visual servoing is achieved since we enforce that at least such a point always remains in the image.For this, instead of driving e(t) −→ 0 an appropriate path tracking e(t) −→ e * (t) is performed.This is accomplished by regulating a time-varying control error e (t) = e(t) − e * (t).
By abuse of notation, we represent in this section e * as the desired control error to be achieved, instead of a value defined w.r.t. the reference frame as throughout the article.The strategy presented in this section is different from [8], where this latter is composed by three phases and requires a coarse metric estimate of the normal vector of the planar target.Indeed, a simple strategy is shown to be sufficient to attain our purposes, i.e. without requiring any metric information and being independent of the object's shape and of the camera's motion.This is due to the properties of the proposed control error ( 14): • we need to plan the trajectory of only one point, which means that physically valid camera situations are always specified; • the projective axis-angle parameterization already provides for a smooth trajectory; • given the isomorphism, there is no singularity or local minima in the large.
1 if the object is not planar, this plane is virtual.
Thus, a linear desired path e * (t) = e * ν (t), e * ω (t) , ∀t ∈ [0, T ], such that e * (0) = e(0) and e * (T ) = 0, can be easily constructed: Nevertheless, motivated by the fact (see the results from Lemma 1 in [17]) that which means that if t = 0 then geodesic rotations will be induced, the rotational part of ( 19) is slightly changed into where the notation e ω (t − 1) refers to the last value of e ω .Therefore, considering a motionless target and willing to regulate (18), the control law ( 17) is transformed into where the feed-forward term ∂e * (t)/∂t allows compensation of the tracking error and is an adaptive gain matrix also motivated by (20): λ ω (t) is small for large e ν (t) and λ ω (t) −→ λ as e ν (t) −→ 0.

V. RESULTS
In this section, we report a diverse set of results to validate our direct visual servoing technique.We use the word "direct" to express both that there is no feature extraction process, and that the control error and control law are computed using only image measurements.For their computation, all pixels within an area of interest are exploited.For all results, this area is delimited by a red grid.The visual servoing task consists in positioning the camera with respect to a rigid object independently of its shape.For this, a reference image is stored at the reference (desired) pose.After displacing the camera (whilst tracking the object in the image) to another pose, the objective is then to drive the camera back to this desired pose.It should be noted that the technique is also independent of the displacement between the initial and desired poses, i.e. it may comprise pure translations, pure rotations or a combination of both.In order to have a real ground truth, we constructed synthetic objects of different shapes and, to simulate realistic situations as closely as possible, real textured images are mapped onto them.For all images shown here, blue marks are used to depict the motion of the control point (m * in Eq. ( 14)) in the image plane, whilst its planned path is projected in green.This latter is typically composed of 1000 points with an adaptive gain (24) with λ = γ = 10.7) evx evy evz ewx ewy ewz Fig. 1.Direct visual servoing with respect to a planar object from an initial pose which is different from the desired one by large rotations and translations.Bottom: evolution of the control errors by using the proposed task function whilst, on the right, by using (7).Observe the rapidly increasing rotational error for the existing method, before converging to zero.
In Fig. 1, it is shown that the proposed method can cope with planar objects, generalizing thus the previously proposed task function (7) which is designed for this particular surface.The control law is stable: both translational and rotational velocities converge to zero.At convergence, the visual information coincides with the reference image, and the camera is positioned at the desired pose very accurately.Errors less than 1mm for the translation and less than 0.1 • for the rotation are achieved.Figure 1 also shows the evolution of the Cartesian displacement (in meters and in radians).The blue marks in the reference image depict the straight line performed by the control point, as desired.In this same set of results, a comparison of the behavior of the control errors between the two methods (defined by (7) and ours) can be seen in the bottom plots.We can observe that a rapidly increasing rotational error is obtained by the existing technique.This behavior may lead to failure of the system.Yet another improvement concerns a positioning task for a rotation of θ = 180 • , which cannot be performed by the existing method.The results achieved by using the proposed task function for this case are presented in Fig. 2, and without any path planning, i.e. using Eq. ( 17) with λ ν = λ ω = 1.Since a pure rotation is given, the result will be the same regardless the shape of the object and so we use a sphere.Nevertheless, it is shown in Fig. 3 that the technique can also deal with this non-planar object under rotational and translational displacements (θ 0 = 84 • and t 0 = 0.72m), without using any prior knowledge either about the object's shape or about the camera's motion.Once again, observe that the control point follows a straight line in the image.
In last set of results, we set up a challenging scenario: the object was an hyperbolic paraboloid (the horse's saddle); the used focal lengths were almost the double of the true ones, i.e. instead of f u = f v = 500 we used f u = 900 and f v = 800; and a large initial displacement (θ 0 = 162 • and t 0 = 0.3m) was carried out.Even in this case, the visual servoing was performed successfully.See Fig. 4 for the corresponding results.This demonstrates that the proposed technique also copes with non-planar objects, that the strategy is robust to large errors in the camera's internal parameters, and that the servoing has a very large domain of convergence.

VI. CONCLUSIONS
In this paper, we have proposed a new approach to visionbased control that does not require or estimate any metric information.Our general technique is independent of the object's shape and of the camera's motion.Thus, we do not rely on prior knowledge (leading to system flexibility), and we achieve robustness to errors in the camera's internal parameters.The sole requirement is that the object has to be rigid.In addition, the visual tracking exploits all image information without any feature extraction process, allowing us to attain high levels of accuracy for the positioning whilst being computationally efficient.This latter strength is also important since real-time performance is always a major concern in robotic systems.Finally, the proposed control law ensures a very large domain of convergence for the visual servoing due to a straightforward path planning.Hence, visual servoing tasks can be performed despite large initial displacements.

Fig. 2 .
Fig. 2. Direct visual servoing with respect to a sphere from an initial pose differing from the desired one by a pure rotation motion (θ = 180 • ).

Fig. 3 .
Fig. 3. Direct visual servoing with respect to a sphere from an initial pose differing from the desired one by rotations and translations.Path planning is composed of only 100 points.

Fig. 4 .
Fig.4.Direct visual servoing with respect to an hyperbolic paraboloid (also called horse's saddle) using very poor camera's internal parameters (almost the double of the true focal lengths), and large displacements between the initial and the desired poses.