Rotation parameterizations in neural networks: some options

mini-research-survey research robotics rotation

Representing rotation is a pain point in computer vision and robotics. This post will explain some recent rotation parameterizations from the academic literature.

Rotations in \(\mathbb{R}^3\), briefly

Rotation can be represented as a \(3 \times 3\) matrix (Directed Cosine Matrix, or DCM), \(R\) in this post, and since that matrix is in the Special Orthogonal group \(SO(3)\), the matrix has to be orthogonal. By orthogonality, I mean

\[R^TR = I_3, \: R^{-1} = R^T, \: R R^{T} = I_3 \label{orthogonality}\]

Wiki: Orthogonal group. The individual columns and rows of matrices in the orthogonal group \(O(n)\) are orthogonal to each other. Members of \(SO(3)\) – rotations – must have a determinant equal to +1. A determinant equal to -1 means the transformation is a reflection, not a rotation Wiki: Orthogonal group/reflection.

The requirement that \(R \in SO(3)\) becomes problematic, when a rotation is included in a cost function to minimize for some robotics or computer vision problem. From Levinson et al. 2020 Levinson2020:

Optimization on \(SO(3)\), and more generally on Riemannian manifolds, is a well-studied problem. Peculiarities arise since \(SO(3)\) is not topologically homeomorphic to any subset of 4D Euclidean space, so any parameterization in four or fewer dimensions will be discontinuous (this applies to all classic representations—Euler angles, axis-angle, and unit quaternions). Discontinuities and singularities are a particular nuisance for classic gradient-based optimization on the manifold [37, 41].

This post will summarize an exchange I had with Eric Brachmann and Krishna Murthy on Twitter, concerning rotation representations in neural networks. I do not claim that this post will include all of the parameterizations ever used; you can let me know of more, different ones.

First paper: unconstrained exponential coordinates

Trevor Avant and Kristi A. Morgansen AM2022, in “On the sensitivity of pose estimation neural networks: rotation parameterizations, Lipschitz constants, and provable bounds”, derive an upper bound on sensitivity of pose estimation networks, for two rotation parameterizations. One of the parameterizations is the quaternion, which is frequently used in computer vision and robotics. The other is a rotation vector, a unit rotation axis vector multiplied by the rotation angle. The Avant and Morgansen 2022 paper terms the rotation vector parameterization, “unconstrained exponential coordinates” – new terminology for me.

Second and third papers: 6d vectors

In “On the Continuity of Rotation Representations in Neural Networks” by Zhou et al. 2019, a 6d parameter vector was introduced as a rotation parameterization, which is converted to the DCM representation (there is also a 5d vector version, which I will skip for now). Eric pointed out that this 6d-parameter representation was also used by Wang et al. 2021.

The main idea in this parameterization is to use two non-zero 3d vectors, normalize them and ensure they are orthogonal (vector1 dotproduct vector2 = 0), and use the cross product to find the third vector to fill out a \(3 \times 3\) rotation matrix (DCM). This process is known as the Gram-Schmidt orthogonalization.

The 6d rotation parameterization is succinctly explained in Wang et al. 2021:

Specifically, the 6-dimensional parameterization \(\mathbf{R}_{\text{6d}}\) is defined as the first two columns of \(\mathbf{R}\)

\[\mathbf{R}_{\text{6d}} = [ \mathbf{R}_{\cdot 1} \vert \mathbf{R}_{\cdot 2} ] \label{sixd}.\]

The rotation matrix \(\mathbf{R} = [ \mathbf{R}_{\cdot 1} \vert \mathbf{R}_{\cdot 2} \vert \mathbf{R}_{\cdot 3} ]\) can be computed according to

\[\begin{cases} \mathbf{R}_{\boldsymbol{\cdot}1} = \phi(\mathbf{r}_1) \\ \mathbf{R}_{\boldsymbol{\cdot}3} = \phi(\mathbf{R}_{\boldsymbol{\cdot}1} \times \mathbf{r}_2) \\ \mathbf{R}_{\boldsymbol{\cdot}2} = \mathbf{R}_{\boldsymbol{\cdot}3} \times \mathbf{R}_{\boldsymbol{\cdot}1} \\ \end{cases}, \label{eq:r6_to_rot}\]

where \(\phi(\bullet)\) denotes the vector normalization operation.

Fourth and fifth papers: SVD

Other works “An Analysis of SVD for Deep Rotation Estimation” Levinson2020 and “Wide-Baseline Relative Camera Pose Estimation with Directional Learning” Chen2021 have used a 9-dimensional rotation parameterization, and then converted this 9d representation to a rotation matrix representation using the solution(s) to the Orthogonal Procrustes problem.

The process is : the 9 parameters are arranged into a \(3 \times 3\) matrix \(\mathbf{A}\), and apply the singular value decomposition (SVD) to \(\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T\).

Then the rotation matrix representation is

\[\mathbf{R} = \mathbf{U} diag(1, 1, det(\mathbf{U} \mathbf{V}^T)) \mathbf{V}^T, \label{so-ortho}\]

which in Levinson et al. 2020 Levinson2020 is called the special orthogonalization \(SVDO^+()\) and maps to the special orthogonal group \(SO(3)\). This result is due to Kabsch 1976, 1978 Kabsch1976, Kabsch1978.

Usually, the solution to the Procrustes orthogonalization problem is

\[\mathbf{R} = \mathbf{U} \mathbf{V}^T \label{procrustes}\]

( Schönemann 1966 Sch1966, Golub and Van Loan 2013 GVL2013 ); in Levinson2020 the \(\mathbf{R} = \mathbf{U} \mathbf{V}^T\) solution is denoted \(SVDO()\) and maps to the orthogonal group \(O(n)\).

Orthogonal group solution versus Special Orthogonal Group solution

What’s the difference between the \(O(3)\) and \(SO(3)\) and the related solutions using SVD? In the \(SO(3)\), the determinant is +1. \(SO(3)\) is a subgroup of \(O(3)\), where the determinant may be -1. A \(O(3)\) transformation with determinant -1 results in a reflection, not a rotation. Wiki: orthogonal group.

The rotation matrix solution \(\mathbf{R} = \mathbf{U} diag(1, 1, det(\mathbf{U} \mathbf{V}^T)) \mathbf{V}^T\) corrects the third eigenvalue to have the correct sign such that \(\mathbf{R}\) has determinant +1. Kabsch’s papers and the Wikipedia article have additional details.

Orientation-preserving transformations: depends on the context

In the Levinson et al. 2020 Levinson2020 paper, as mentioned above there are two solutions to SVD orthogonalization (again, with \(\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T\) being the SVD decomposition).

\[\DeclareMathOperator*{\argmin}{arg\,min} SVDO(A) = \argmin_{\mathbf{R} \in O(n)} \| \mathbf{R} - \mathbf{A} \|_F^2 \label{SVDO}\] \[SVDO(A) = \mathbf{U} \mathbf{V}^T \label{SVDO2}\]

and

\[\DeclareMathOperator*{\argmin}{arg\,min} SVDO^+(A) = \argmin_{\mathbf{R} \in SO(n)} \| \mathbf{R} - \mathbf{A} \|_F^2 \label{SVDO-plus}\] \[\mathbf{R} = \mathbf{U} diag(1, 1, det(\mathbf{U} \mathbf{V}^T)) \mathbf{V}^T, \label{so-ortho2}\]

I was confused by a sentence in section 3 of that paper, the first half:

\(SVDO\) is orientation-preserving, while \(SVDO+\) maps to \(SO(n)\).

I had remembered that in the Hartley and Zisserman 2004 book MVG, in the description of isometries, this definition of orientation-preserving (p.38):

If \(\epsilon = 1\) then the isometry is orientation-preserving and is a Euclidean transformation (a composition of a translation and rotation). If \(\epsilon = -1\) then the isometry reverses orientation.

I was considering ‘orientation-preserving’ within the Hartley and Zisserman 2004 context: given a matrix \(\mathbf{R}\), transform some 3D objects. Is \(\mathbf{R}\) orientation-preserving? In this context, \(SVDO^+\) is orientation-preserving, but \(SVDO\) may or may not be orientation-preserving.

However, the context of this paper considers ‘orientation-preserving’ differently. One of the paper’s authors, Ameesh Makadia, kindly explained to me that “\(SVDO\) is orientation-preserving” means that a transformation will have the same orientation of the input (\(\mathbf{A}\) in this post). In other words, the determinants of \(\mathbf{A}\) and \(SVDO(\mathbf{A})\) will have the same sign.

References

[AM2022] Trevor Avant and Kristi A. Morgansen. “On the sensitivity of pose estimation neural networks: rotation parameterizations, Lipschitz constants, and provable bounds”. 2022. arXiv:2203.09937 [cs.CV] DOI 10.48550/arXiv.2203.09937 arXiv code.

[Zhou2019] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, Hao Li. “On the Continuity of Rotation Representations in Neural Networks”. 2019 CVPR. CVF link arXiv:1812.07035 [cs.LG] arXiv

[Wang2021] Gu Wang, Fabian Manhardt, Federico Tombari, Xiangyang Ji. “GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation.” In CVPR 2021 CVF arXiv:2102.12145 [cs.CV] arXiv code

[Levinson2020] Jake Levinson, Carlos Esteves, Kefan Chen, Noah Snavely, Angjoo Kanazawa, Afshin Rostamizadeh, Ameesh Makadia. “An Analysis of SVD for Deep Rotation Estimation”. NeurIPS 2020. NeurIPS arXiv:2006.14616v1 [cs.CV] arXiv.

[Kabsch1976] Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystal- lographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 32(5):922–923, 1976. doi:10.1107/S0567739476001873.

[Kabsch1978] Kabsch, Wolfgang (1978). “A discussion of the solution for the best rotation to relate two sets of vectors”. Acta Crystallographica. A34 (5): 827–828. doi:10.1107/S0567739478001680.

[Chen2021] Kefan Chen, Noah Snavely, Ameesh Makadia. “Wide-Baseline Relative Camera Pose Estimation with Directional Learning”. CVPR 2021 CVF arXiv:2106.03336v1 [cs.CV] arXiv. code.

[GVL2013] Gene H. Golub and Charles F Van Loan. 2013. Matrix Computations. ISBN 9781421407944.

[Sch1966] H. Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31:1–10, 1966. link

[MVG2004] Richard Hartley and Andrew Zisserman, Multiple View Geometry in Computer Vision, 2nd edition. link

© Amy Tabb 2018 - 2023. All rights reserved. The contents of this site reflect my personal perspectives and not those of any other entity.