Entailment Cones for Better Hierarchical Image Classification

16 min readApr 16, 2021

Written by Caroline Freyer and Marios Marinos

In this blog, the approach introduced by Dhall et al. [2020] to leverage information about the semantic hierarchy embedded in class labels is explained. Dhall et al. published an article, empirically showing the availability of such external semantic information and propose several models that exploit this knowledge. They validate these models on the hierarchical ETHEC dataset. Our hope with this blog is to provide the reader with a more detailed and straightforward explanation of the concepts addressed in the paper. This explanation is based on our own research as well as our efforts to reproduce the results of the paper using the repository Dhall et al. provided.

Reproducibility is the ability to recreate the paper results and thereby validate and establish credibility to the proposed method. Many argue that for a paper to be scientifically sound and complete, it should be independently reproducible [Raff, 2019] i.e. without additional help from the authors. Furthermore, the ability to reproduce paper results is a difficult, but essential skill that must be practised. To keep up with the fast pace environment, it is necessary to build off of the work of others and to do this effectively, reproduction is the main step. In this blog, we hope to further the understanding of the topics address by Dhall et al. [2020], but also share our experience with reproducing the paper.

In the field of machine learning, classification is an instance of supervised learning. Formally, super-vised learning problems are applications in which the training data comprises examples of the input vectors with their corresponding target vectors [Bishop, 2006]. Given this labelled training data, the supervised model must learn a function that maps an input to a target vector. In classification, the target vectors are elements of a finite discrete set of categories, commonly referred to as classes. Usually, classification is performed by independently predicting the class probabilities i.e. the probability of the input vector belonging to each of the classes, and predicts the class that corresponds to the largest class probability. This assumes mutually exclusive and unstructured labels. However, Dhall et al. [2020] argue that the labels in many common datasets have an underlying organisation allowing for the extraction of hierarchical information. Incorporating such a hierarchy could improve classification as it uses external guidance other than traditional object-label pairs for training. The relations (hierarchy) of a partial order over images and language, illustrated in Figure 1, which authors refer to as the visual semantic hierarchy. Moreover, it improves generalisation on classes for which training data is scarce by leveraging shared features and bridges the gap between the ways humans and machines understand concepts visually. Thus, both performance and explainability are improved. Moreover, considering only image-based features to distinguish between different objects may be insufficient, especially if the intra-class variance is higher than the inter-class variance. This classifies images that should be together in the same class in different classes [Dhall et al., 2020]. Thus, using both label-label interactions as well as object-label interactions could improve the quality of learning.

Figure 1: Sample images from the ETHEC dataset with their 4 levels [Dhall et al., 2020].

Dhall et al. propose multiple approaches for injecting hierarchical semantics in state-of-the-art CNNclassifiers. These approaches can be grouped into two categories; one where the hierarchy is exploited in the loss function and the other which embeds images and labels in a common space. The latter is based on entailment cones which can be embedded in both Euclidean and Hyperbolic spaces. All these methods are validated on the ETHEC dataset, a hierarchical dataset of entomology (insects)from ETH Zurich. This dataset includes 47978 images with 4 different levels and 723 labels. Sample images are shown in Figure 1. To help the readers visualise what is meant by a hierarchical datasetFigure 2 illustrates a small toy dataset of three images with their hierarchical properties. The arrows in the figure represent the hierarchical relations to the different categories.

FIgure 2: A slice of the visual-semantic hierarchy [Vendrov et al., 2015].

Order-embedding

Order-embeddings preserve the order between objects rather than their distances. This is important as physical closeness may not always represent the same object due to high variance. More specifically, order-embeddings are monotone functions that provide a way to include one partially ordered set into another. A partially ordered set (poset) is a set of values with an associated binary relation such as ≤. In an ordered set, every pair of elements needs to be comparable by the binary relation, however, for partially ordered sets this requirement is relaxed as it allows for pairs such that neither element proceeds the other. An example of a partially ordered set is (N+,÷), where N+ denotes the set of natural numbers, excluding zero. In this set, 3 and 5 are not comparable as 3÷5 and 5÷3 are not values contained in N+. On the other hand, 6 and 3 are because 6÷3 = 2∈N+.To differentiate between the pairs that are ordered and those that are not, Dhall et al. [2020] define sets P and N, respectively.

For completeness we give the following definition of an order-embedding from Schmidt [2010]:

Such orderings enforce transitivity and anti-symmetric relations without having to rely on the physical closeness between points. We define these concepts as follows [Schmidt, 2010]:

An example of an ordering embedding is shown in Figure 3. In this example, the function f(x) =(94x+ 3)/100 maps the intervals the open interval (0,1) and the closed [0,1] to the subset (0.03,0.97)and [0.03,0.97], respectively. Thus,fis both order-preserving and order-reflecting.

Mutual order embedding of the open interval (0,1) and the closed [0,1] usingf(x) =(94x+ 3)/100 in both directions. Figure by Jochen Burghardt.

Using such orderings to embed hierarchical semantics gives structure to the embedding space that is necessary to leverage the hierarchical information within the classification. The goal is to determine if an arbitrary pair in the embedded space is ordered. Dhall et al. [2020] use the following relation:

Note the reversal of the direction in the expression. This refers to the fact that the top element of the order is the more general concept. In words, the relation defines y to be a subconcept of x if all its individual coordinates are greater or equal to those of x. Given the ordering, the next step is to find the order-embedding. In practice, the formal definition given for an order-embedding is too restrictive. Therefore, an approximate order-embedding is used. First, a penalty term is defined to measure the degree to which a pair of points violates the order. We define this penalty, later called Energy, as follows:

In words, this is the absolute difference between x and y if y succeeds x. If y preseeds or equals x then E(x,y) = 0. Then to learn the order-embedding f we minimise the max-margin loss which encourages positive pairs (in set P) to have zero penalties in the embedding and negative pairs (in set N) to have a penalty greater than a margin α in the embedding. Formally, the max-margin loss is defined as

Hyperbolic Geometry

Hyperbolic geometry is a non-Euclidean geometry that rejects the fifth postulate:

In hyperbolic geometry, there are at least two lines parallel to the given line that go through the given point. However, unlike Euclidean geometry where the lines are taken to be equidistant everywhere, the lines converge in one direction and diverge in another. Moreover, the sum of the angles in a triangle is less than two right angles in hyperbolic geometry. Figure 4 summarises these concepts visually.

Figure 4: Summary of the difference between hyperbolic and Euclidean geometry [A’Campo and Papadopoulos , 2012]

A model for such a geometry is Poincare’s ball. In N dimensions it is defined by the manifold

where

This is the ball of radius one, defined in the hyperbolic space. In this case, ‖x‖_D is the hyperbolic equivalent to the standard norm in the euclidean space defined by

See Figure 6 for a visual representation of a Poincare’s ball with its tangent space. These are the necessary constructs for defining the hyperbolic cone which is used in the embedding classifiers. The motivation behind using such a geometry is that the volume of the ball grows exponentially with the radius as compared to polynomially in Euclidean geometry and thus we can embed exponentially-growing hierarchies in a low-dimensional space. Moreover, it models tree-like structures better which is the data structure used to define the label hierarchy.

Hyperbolic Cones

Traditional order-embeddings used orthants as the distance function. Thus, if v is a sub-concept of u it lies within the orthant at u. However, this results in each concept occupying a large volume in the embedded space and suffers from heavy orthant intersection. Therefore, cones were used instead. Figure 5 shows the difference visually.

Figure 5: Example comparing the use of orthants and cones in the embedded space.

Hyperbolic cones are euclidean cones (shown in Figure 5) mapped onto the hyperbolic geometry. Such cones are generalisations of order-embeddings and are more flexible than the commonly known Euclidean cones. To define these cones we introduce the exponential map at point x in Poincare’s ball defined as

Then we can define a cone in the target space and project it using this map. An illustration of this from Ganea et al. [2018] can be found in Figure 6.

Figure 6: Visualisation of hyperbolic cones in 3D space from [Ganea et al. , 2018].

For x in Poincare’s ball the aperture of the cone is

where K is a hyper-parameter to be tuned. Note that the maximum aperture of the cone is π/2. Given these constructs, we can define the minimum angle between the axis of the cone and a vector y. This will be used to approximate the energy between x and y. We define this angle as

Then the energy between x and y can be defined as

which represents the minimum angle required to rotate the axis of the cone at x to being y into the cone. Thus, we want the minimum angle to be less than the size of the aperture in order to minimise the energy between the two points.

Optimisation in the hyperbolic space

For parameters in the hyperbolic space Riemannian stochastic gradient descent (RSGD) is used. This uses the update

where the gradient is the Riemannian gradient for parameter u, η is the learning rate, and L is the order-violation we are trying to minimise. The Riemannian stochastic gradient descent is defined as a rescaling of the Euclidean gradient as follows:

where λ_u= 2/(1−‖u‖^2). However, it was noted by Dhall et al. [2020] that an approximated version of Adam’s optimiser in the hyperbolic space worked better than RSGD which is exact. Nonetheless, RSGD was still used for part of the optimisation. This will be further explained in subsequent sections.

Embedding-based models

Based on some defined similarity metric, these models return the closest concept in the embedding space for a given query [Dhall et al., 2020]. In usual settings, the labels and the objects are embedded into the same space. Then the similarity between the object and all the labels is calculated and the most optimal cross-modal relationship is returned. Common mappings include neural networks and a common similarity metric is the dot product. Extending this concept into hyperbolic geometry with the use of hyperbolic cones alleviates allows for more flexibility when leveraging the label hierarchy. This is in contrast to injecting the hierarchical information into the loss function which is very restrictive and must be tailored to a specific downstream task.

In the above section, the general constructs were defined for the classification task addressed by Dhall et al. [2020]. The approaches considered by Dhall et al. [2020] can be categorised into two: CNN-based classifiers where the hierarchical information is injected into the loss function that is minimised, and embedding classifiers where the labels and images are embedded in the same space. In this blog, we focus on two CNN-based classifiers, Per-level classifier (PLC) and Marginalisationclassifier (MC) as these are the ones we reproduced. Furthermore, we describe the jointly embedding classifiers using hyperbolic cones.

CNN-based models

In this section, we address classifiers that incorporate the label hierarchies into the loss function. We first describe the Hierarchy-agnostic baseline classifier (HAB) which is used as a baseline to compare the effect of the hierarchical information. Later we describe the PLC and MC classifiers. Dhall et al. [2020] argue that these models are less flexible than the embedding classifiers as the hierarchies are typically fixed and tailored to one specific downstream task. However, we later see that the results of MC outperform those of the embedding classifiers and are thus worth considering further.

Hierarchy-agnostic baseline classifier (HAB): For the Hierarchy-agnostic baseline (HAB)classifier, a residual network for image classification is used [He et al., 2016]. This classifier does not incorporate any label hierarchy in the dataset. It performs classification on the N_t labels where N_t represents the total labels across all L levels (Figure 8). For this classification, the multi-label soft-margin loss is minimised. This loss is defined as [Dhall et al., 2020]:

Moreover, x ∈ R^{N_t} are the logits from the last layer of the model, y∈{0,1}^{N_t} is the one-hot-encoded label for the input which corresponds to the largest element of x.

Figure 8: Model schematic for the hierarchy-agnostic classifier. The model is a multi-label classifier and does not utilize any information about the presence of an explicit hierarchy in the labels [Dhallet al., 2020]

Per-level classifier (PLC): Per-level classifier is very similar to HAB classifier but instead of using a single N_t-way classifier, it is replaced with L N_i-way classifiers. In PLC there is a classifier for each of the L layers that handles the N_i labels present on level i (Figure 9). The loss function it uses is the sum of the multi-label soft-margin loss for each of the separate classifiers. It is defined as follows:

Dhall et al. [2020] note that although the notion of L levels is built into PLC, it is still unaware of the relationship between nodes across levels. Hence, there is a need for a classifier that can “remember” the relationship across different levels and this is where MC comes in.

Figure 9: Model schematic for the per-level classifier (=L Ni-way classifiers). The model use information about the label hierarchy by explicitly predicting a single label per level for a given image[Dhall et al., 2020].

Marginalization classifier (MC): The MC classifier uses a single classifier that outputs a probability distribution over the final level in the hierarchy. For the remaining L−1 levels, Dhall et al.[2020] compute the probability distribution over the labels of the level by summing the probability of the children nodes (Figure 10). For this classifier, the loss that is minimized is

Figure 10: Model schematic for the Marginalization method. Instead of predicting a label per level, the model outputs a probability distribution over the leaves of the hierarchy. The probability for non-leaf nodes is determined by marginalizing over the direct descendants. The Marginalization method models how different nodes are connected among each other in addition to the fact that there are L levels in the label hierarchy [Dhall et al., 2020].

To summarise the above information concisely, we present Table 1.

Order-embedding models

For this model, the labels and images are embedded into the same N-dimensional hyperbolic space. The labels are defined in this space, yet the images are represented by features in the R^2048 space. These features are extracted from the best performing CCN-based model described above. To map them to the D^N space, a learned linear transform W∈R^{2048×N} is used. Then the results of this transformation are mapped toDNusing the exponential map at 0 defined in Equation 1.Next the label hierarchy must be represented in this space. The label hierarchy is treated as a directed acyclic graph defined by the dataset X which consists of entailment relations (u,v). These relations represent a directed edge from concept u to concept v and convey that v is a sub-concept of u. The model is then trained to embed these semantics into the space. Within the training, edges that do not exist in X, referred to as negative edges, are added to the training data to allow for a more robust embedding. To improve convergence and help better embed the hierarchy into the space, Dhall et al. [2020] restrict the negative edge generation to only form edges between different levels in the hierarchy. The Adam optimiser is used for this embedding. Now that everything is embedded in the space, we can calculate the energy E between each image and each label and choose the label corresponding to the minimum energy.

Training Environment: As the training process of a neural network could be an intensive and heavy process, we decided to use the google cloud platform to training the classifiers as it provides graphics processing units (GPUs). A model could take approximately 30 to 35 hours to train for100 epochs using a GPU.

Figure 7: Jointly embedding labels and images using EC in R^2. The images are accumulated around the periphery, away from the origin [Dhall et al.,2020].

Hyper Parameters: In our reproduction, we tried to reproduce the results of Table 2 Dhall et al.[2020]. More specifically, we tried to reproduce the results for the PLC and MC models described above. The hyper-parameters used are shown in Table 2. For the MC model, the exact same parameters were used but the learning rate was 0.0001 instead of 0.01.

In Tables 3 and 4 we compare our reproduction results with the results in Table 2 of Dhall et al.[2020]. We see that our reproduction model underperforms compared to theirs. That might be due to the stochasticity involved in the optimisation of the Neural Networks. Nonetheless, we see the same general trend for the different layers. This trend speaks more to the consistency of the results than their values. Therefore, we were satisfied with the reproduction despite the underperformance. We believe that training the model again (maybe we a different initialisation) may produce better results, however, due to time constraints, it was not possible for us to check this.

Code: When it came to diving into the code we had quite a hard time understanding their code. The code files were huge and messy. Fortunately, for the most part, we only had to configure the code and not implement new features or change existing ones. Besides that, the code was not too difficult to run, after changing the paths of the data and performing minor changes in the code.

Our work in reproduction: Initially the goal was to reproduce the results of three models: PLC, MC and HC. Unfortunately, we were not able to achieve this goal. When attempting to run the HC model we were missing three .npy files from the GitHub repository that were necessary to run the model. Thus, we contacted the authors asking for these files. Unfortunately, they did not have these files anymore but gave us a rough procedure on how to generate them. However, due to time constraints, we were not able to implement this procedure. Yet, we provide you with some pseudocode on how we think it should be implemented if we had the required time.

The dataset, and therefore the inputs that the model will get on line 9 (model.forward(dataset)), will be the images for the ETHEC dataset (224 x 224 RGB). Thebest_weights.pty are the weights for the best performing CNN model, in our case the MC model. This step is used so that it is not necessary to retrain the MC model. The output of the above pseudocode are three .npy files (train, test, val) that should be dictionaries with the image name (e.g.ETHZ_ENT01_2017 _03_30_008939.JPG) as the key and the fc-7 features as the values. We assume that they are just NumPy arrays but this is what still needs to be checked and investigated.

To summarise, the paper tries to inject differential geometry in the domain of image classification and neural networks. Some troubles that we came across was the organization of the code in the GitHub repository of this paper, which was quite messy and difficult to understand, as also the authors themselves mentioned during our correspondence. We would like to reproduce the results of the Hyperbolic Cones (HC) classifier, but due to missing files and time restrictions we were not able to do this. However, the best model among all, according to the authors Dhall et al. [2020], was the Marginalization Classifier (MC), so was this massive and complicated extension was even necessary?

At first glance, it seems that we and Dhall et al. [2020] may have wasted our time with this extensive research and explanation about hyperbolic cones and hyperbolic geometry, given that the results for the MC classifier are better than those of the HC classifier. However, this is the first paper addressing such an approach. Theoretically speaking, the volume of the ball in hyperbolic geometry grows exponentially with the radius as compared to polynomially in Euclidean geometry. Thus we can embed exponentially-growing hierarchies in a low-dimensional space. Moreover, it models tree-like structures better which is the data structure used to define the label hierarchy. Therefore, despite the mediocre results the concept is valid and could be utilised in many applications to potentially improve classification. Therefore, this idea should be exploited further and applied in different applications. For example, it could outperform the state-of-the-art when considering the correlation between objects rather than trying to classify them. The main take away of Dhallet al. [2020] is that embedding hierarchical semantics into classifiers improves their performance. Using hyperbolic geometry to further aid in this performance boost is something that should still be actively explored.

References

N. A’Campo and A. Papadopoulos. Notes on hyperbolic geometry.Strasbourg Master class on Geometry, 18:1 –182, 2012. doi: 10.4171/105.

Christopher M. Bishop.Pattern Recognition and Machine Learning. Springer, 2006.

Ankit Dhall, Anastasia Makarova, Octavian Ganea, Dario Pavllo, Michael Greeff, and An-dreas Krause.Hierarchical image classification using entailment cone embeddings.CoRR,abs/2004.03459, 2020. URLhttps://arxiv.org/abs/2004.03459.

O. Ganea, Becigneul G., and T. Hofmann. Hyperbolic entailment cones for learning hierarchicalembeddings. 2018.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi:10.1109/CVPR.2016.90.

Edward Raff. A step toward quantifying independently reproducible machine learning research,2019.

Gunther Schmidt.Relational Mathematics.Cambridge University Press, USA, 2010.ISBN0521762685.

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images andlanguage.arXiv preprint arXiv:1511.06361, 2015.