2024

J. Appl. Comput. Topol., DOI: 10.1007/s41468-024-00182-9 (2024)

The magnitude of a finite metric space has recently emerged as a novel invariant quantity, allowing to measure the effective size of a metric space. Despite encouraging first results demonstrating the descriptive abilities of the magnitude, such as being able to detect the boundary of a metric space, the potential use cases of magnitude remain under-explored. In this work, we investigate the properties of the magnitude on images, an important data modality in many machine learning applications. By endowing each individual image with its own metric space, we are able to define the concept of magnitude on images and analyse the individual contribution of each pixel with the magnitude vector. In particular, we theoretically show that the previously known properties of boundary detection translate to edge detection abilities in images. Furthermore, we demonstrate practical use cases of magnitude for machine learning applications and propose a novel magnitude model that consists of a computationally efficient magnitude computation and a learnable metric. By doing so, we address one computational hurdle that used to make magnitude impractical for many applications and open the way for the adoption of magnitude in machine learning research.

Wissenschaftlicher Artikel

Scientific Article

Nat. Commun. 15:5034 (2024)

Functionally relevant coronary artery disease (fCAD) can result in premature death or nonfatal acute myocardial infarction. Its early detection is a fundamentally important task in medicine. Classical detection approaches suffer from limited diagnostic accuracy or expose patients to possibly harmful radiation. Here we show how machine learning (ML) can outperform cardiologists in predicting the presence of stress-induced fCAD in terms of area under the receiver operating characteristic (AUROC: 0.71 vs. 0.64, p = 4.0E-13). We present two ML approaches, the first using eight static clinical variables, whereas the second leverages electrocardiogram signals from exercise stress testing. At a target post-test probability for fCAD of <15%, ML facilitates a potential reduction of imaging procedures by 15-17% compared to the cardiologist's judgement. Predictive performance is validated on an internal temporal data split as well as externally. We also show that combining clinical judgement with conventional ML and deep learning using logistic regression results in a mean AUROC of 0.74.

Wissenschaftlicher Artikel

Scientific Article

Lab. Invest. 104:102060 (2024)

Precision medicine aims to provide personalized care based on individual patient characteristics, rather than guideline-directed therapies for groups of diseases or patient demographics. Images-both radiology- and pathology-derived-are a major source of information on presence, type, and status of disease. Exploring the mathematical relationship of pixels in medical imaging ("radiomics") and cellular-scale structures in digital pathology slides ("pathomics") offers powerful tools for extracting both qualitative, and increasingly, quantitative data. These analytical approaches, however, may be significantly enhanced by applying additional methods arising from fields of mathematics such as differential geometry and algebraic topology that remain underexplored in this context. Geometry's strength lies in its ability to provide precise local measurements, such as curvature, that can be crucial for identifying abnormalities at multiple spatial levels. These measurements can augment the quantitative features extracted in conventional radiomics, leading to more nuanced diagnostics. By contrast, topology serves as a robust shape descriptor, capturing essential features such as connected components and holes. The field of topological data analysis was initially founded to explore the shape of data, with functional network connectivity in the brain being a prominent example. Increasingly, its tools are now being used to explore organizational patterns of physical structures in medical images and digitized pathology slides. By leveraging tools from both differential geometry and algebraic topology, researchers and clinicians may be able obtain a more comprehensive, multi-layered understanding of medical images and contribute to precision medicine's armamentarium.

Review

Review

Digit. Scholarsh. Humanit. 39, 74-96 (2024)

We introduce HYPERBARD, a dataset of diverse relational data representations derived from Shakespeare's plays. Our representations range from simple graphs capturing character co-occurrence in single scenes to hypergraphs encoding complex communication settings and character contributions as hyperedges with edge-specific node weights. By making multiple intuitive representations readily available for experimentation, we facilitate rigorous representation robustness checks in graph learning, graph mining, and network analysis, highlighting the advantages and drawbacks of specific representations. Leveraging the data released in HYPERBARD, we demonstrate that many solutions to popular graph mining problems are highly dependent on the representation choice, thus calling current graph curation practices into question. As an homage to our data source, and asserting that science can also be art, we present our points in the form of a play.

Wissenschaftlicher Artikel

Scientific Article

2023

In: (37th Conference on Neural Information Processing Systems (NeurIPS), 10-16 December 2023, New Orleans, LA). 10010 North Torrey Pines Rd, La Jolla, California 92037 Usa: Neural Information Processing Systems (nips), 2023. 26

Graph generative model evaluation necessitates understanding differences between graphs on the distributional level. This entails being able to harness salient attributes of graphs in an efficient manner. Curvature constitutes one such property that has recently proved its utility in characterising graphs. Its expressive properties, stability, and practical utility in model evaluation remain largely unexplored, however. We combine graph curvature descriptors with emerging methods from topological data analysis to obtain robust, expressive descriptors for evaluating graph generative models.

J. Mach. Learn. Res. 24:333 (2023)

In recent years, algorithms and neural architectures based on the Weisfeiler-Leman algorithm, a well-known heuristic for the graph isomorphism problem, have emerged as a powerful tool for machine learning with graphs and relational data. Here, we give a comprehensive overview of the algorithm's use in a machine-learning setting, focusing on the supervised regime. We discuss the theoretical background, show how to use it for supervised graph and node representation learning, discuss recent extensions, and outline the algorithm's connection to (permutation-)equivariant neural architectures. Moreover, we give an overview of current applications and future directions to stimulate further research.

Review

Review

SIAM J. Math. Data Sci. 5, 346-372 (2023)

Diffusion condensation is a dynamic process that yields a sequence of multiscale data representations that aim to encode meaningful abstractions. It has proven effective for manifold learning, denoising, clustering, and visualization of high-dimensional data. Diffusion condensation is constructed as a time-inhomogeneous process where each step first computes a diffusion operator and then applies it to the data. We theoretically analyze the convergence and evolution of this process from geometric, spectral, and topological perspectives. From a geometric perspective, we obtain convergence bounds based on the smallest transition probability and the radius of the data, whereas from a spectral perspective, our bounds are based on the eigenspectrum of the diffusion kernel. Our spectral results are of particular interest since most of the literature on data diffusion is focused on homogeneous processes. From a topological perspective, we show that diffusion condensation generalizes centroid-based hierarchical clustering. We use this perspective to obtain a bound based on the number of data points, independent of their location. To understand the evolution of the data geometry beyond convergence, we use topological data analysis. We show that the condensation process itself defines an intrinsic condensation homology. We use this intrinsic topology, as well as the ambient persistent homology, of the condensation process to study how the data changes over diffusion time. We demonstrate both types of topological information in well-understood toy examples. Our work gives theoretical insight into the convergence of diffusion condensation and shows that it provides a link between topological and geometric data analysis.

Wissenschaftlicher Artikel

Scientific Article

In: (Proceedings of Machine Learning Research). 2023. 1-2 ( ; 221)

In: (Proceedings of Machine Learning Research). 2023. 242-253 ( ; 221)

Deep learning models have seen significant successes in numerous applications, but their inner workings remain elusive. The purpose of this work is to quantify the learning process of deep neural networks through the lens of a novel topological invariant called magnitude. Magnitude is an isometry invariant; its properties are an active area of research as it encodes many known invariants of a metric space. We use magnitude to study the internal representations of neural networks and propose a new method for determining their generalisation capabilities. Moreover, we theoretically connect magnitude dimension and the generalisation error, and demonstrate experimentally that the proposed framework can be a good indicator of the latter.

Not. Amer. Math. Soc. 70, 1640-1644 (2023)

Wissenschaftlicher Artikel

Scientific Article

In: (Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, 26-30 June 2023, Santa Fe, USA). 2023. 319-320 ( ; 2023-June)

With millions of research articles published yearly, the peer review process is in danger of collapsing, especially in 'hot' areas with popular conferences. Challenges arise from the large number of manuscripts submitted, skyrocketing use of preprint archives and institutional repositories, problems regarding the identification and availability of experts, conflicts of interest, and bias in reviewing. Such issues can affect the integrity of the reviewing process as well as the timeliness, quality, credibility, and reproducibility of research articles. Several solutions and systems have been suggested, but none work well, and neither authors nor editors are happy with how long it takes to complete reviewing the submitted research. This panel addresses these challenges and potential solutions, including digital libraries that recommend reviewers, as well as broader issues like opportunities for identifying peer reviewers for scholarly journals by engaging doctoral students and postdocs, as well as those who recently completed their Ph.D.

In: (Proceedings of Machine Learning Research). 2023. 35175-35197 ( ; 202)

The manifold hypothesis, which assumes that data lies on or close to an unknown manifold of low intrinsic dimension, is a staple of modern machine learning research. However, recent work has shown that real-world data exhibits distinct non-manifold structures, i.e. singularities, that can lead to erroneous findings. Detecting such singularities is therefore crucial as a precursor to interpolation and inference tasks. We address this issue by developing a topological framework that (i) quantifies the local intrinsic dimension, and (ii) yields a Euclidicity score for assessing the 'manifoldness' of a point along multiple scales. Our approach identifies singularities of complex spaces, while also capturing singular structures and local geometric complexity in image data.

In: (Proceedings - International Symposium on Biomedical Imaging, 18-21 April 2023, Cartagena, Colombia). 345 E 47th St, New York, Ny 10017 Usa: Ieee, 2023. 5 ( ; 2023-April)

Diffusion models are a special type of generative model, capable of synthesising new data from a learnt distribution. We introduce DISPR, a diffusion-based model for solving the inverse problem of three-dimensional (3D) cell shape prediction from two-dimensional (2D) single cell microscopy images. Using the 2D microscopy image as a prior, DISPR is conditioned to predict realistic 3D shape reconstructions. To showcase the applicability of DISPR as a data augmentation tool in a feature-based single cell classification task, we extract morphological features from the red blood cells grouped into six highly imbalanced classes. Adding features from the DISPR predictions to the three minority classes improved the macro F1 score from F1macro = 55.2 ± 4.6% to F1macro = 72.2 ± 4.9%. We thus demonstrate that diffusion models can be successfully applied to inverse biomedical problems, and that they learn to reconstruct 3D shapes with realistic morphological features from 2D microscopy images.

In: (IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops). 2023. 571-579 ( ; 2023-June)

The computer vision task of reconstructing 3D images, i.e., shapes, from their single 2D image slices is extremely challenging, more so in the regime of limited data. Deep learning models typically optimize geometric loss functions, which may lead to poor reconstructions as they ignore the structural properties of the shape. To tackle this, we propose a novel topological loss function based on the Euler Characteristic Transform. This loss can be used as an inductive bias to aid the optimization of any neural network toward better reconstructions in the regime of limited data. We show the effectiveness of the proposed loss function by incorporating it into SHAPR, a state-of-the-art shape reconstruction model, and test it on two benchmark datasets, viz., Red Blood Cells and Nuclei datasets. We also show a favourable property, namely injectivity and discuss the stability of the topological loss function based on the Euler Characteristic Transform.

Entropy 25:691 (2023)

We study selection bias in meta-analyses by assuming the presence of researchers (meta-analysts) who intentionally or unintentionally cherry-pick a subset of studies by defining arbitrary inclusion and/or exclusion criteria that will lead to their desired results. When the number of studies is sufficiently large, we theoretically show that a meta-analysts might falsely obtain (non)significant overall treatment effects, regardless of the actual effectiveness of a treatment. We analyze all theoretical findings based on extensive simulation experiments and practical clinical examples. Numerical evaluations demonstrate that the standard method for meta-analyses has the potential to be cherry-picked.

Wissenschaftlicher Artikel

Scientific Article

2022

In: (Proceedings of Machine Learning Research). 2022. accepted ( ; 198)

Graph Neural Networks (GNNs) extend the success of neural networks to graph-structured data by accounting for their intrinsic geometry. While extensive research has been done on developing GNN models with superior performance according to a collection of graph representation learning benchmarks, it is currently not well understood what aspects of a given model are probed by them. For example, to what extent do they test the ability of a model to leverage graph structure vs. node features? Here, we develop a principled approach to taxonomize benchmarking datasets according to a sensitivity profile that is based on how much GNN performance changes due to a collection of graph perturbations. Our data-driven analysis provides a deeper understanding of which benchmarking data characteristics are leveraged by GNNs. Consequently, our taxonomy can aid in selection and development of adequate graph benchmarks, and better informed evaluation of future GNN methods. Finally, our approach and implementation in GTaxoGym package1 are extendable to multiple graph prediction task types and future datasets.

In: (Proceedings of Machine Learning Research). 2022. 142-151 ( ; 196)

Graph embedding techniques are a staple of modern graph learning research. When using embeddings for downstream tasks such as classification, information about their stability and robustness, i.e., their susceptibility to sources of noise, stochastic effects, or specific parameter choices, becomes increasingly important. As one of the most prominent graph embedding schemes, we focus on node2vec and analyse its embedding quality from multiple perspectives. Our findings indicate that embedding quality is unstable with respect to parameter choices, and we propose strategies to remedy this in practice.

In: (Proceedings of Machine Learning Research). 2022. 1-5 ( ; 196)

In: (Advances in Neural Information Processing Systems). 2022. ( ; 35)

We study the excess capacity of deep networks in the context of supervised classification. That is, given a capacity measure of the underlying hypothesis class - in our case, empirical Rademacher complexity - to what extent can we (a priori) constrain this class while retaining an empirical error on a par with the unconstrained regime? To assess excess capacity in modern architectures (such as residual networks), we extend and unify prior Rademacher complexity bounds to accommodate function composition and addition, as well as the structure of convolutions. The capacity-driving terms in our bounds are the Lipschitz constants of the layers and an (2, 1) group norm distance to the initializations of the convolution weights. Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. Overall, this suggests a notion of compressibility with respect to weight norms, complementary to classic compression via weight pruning. Source code is available at https://github.com/rkwitt/excess_capacity.

In: (Advances in Neural Information Processing Systems). 2022. ( ; 35)

We introduce a new intrinsic measure of local curvature on point-cloud data called diffusion curvature. Our measure uses the framework of diffusion maps, including the data diffusion operator, to structure point cloud data and define local curvature based on the laziness of a random walk starting at a point or region of the data. We show that this laziness directly relates to volume comparison results from Riemannian geometry. We then extend this scalar curvature notion to an entire quadratic form using neural network estimations based on the diffusion map of point-cloud data. We show applications of both estimations on toy data, single-cell data and on estimating local Hessian matrices of neural network loss landscapes.

Lect. Notes Comput. Sc. 13434 LNCS, 150-159 (2022)

Reconstructing 3D objects from 2D images is both challenging for our brains and machine learning algorithms. To support this spatial reasoning task, contextual information about the overall shape of an object is critical. However, such information is not captured by established loss terms (e.g. Dice loss). We propose to complement geometrical shape information by including multi-scale topological features, such as connected components, cycles, and voids, in the reconstruction loss. Our method uses cubical complexes to calculate topological features of 3D volume data and employs an optimal transport distance to guide the reconstruction process. This topology-aware loss is fully differentiable, computationally efficient, and can be added to any neural network. We demonstrate the utility of our loss by incorporating it into SHAPR, a model for predicting the 3D cell shape of individual cells based on 2D microscopy images. Using a hybrid loss that leverages both geometrical and topological information of single objects to assess their shape, we find that topological information substantially improves the quality of reconstructions, thus highlighting its ability to extract more relevant features from image datasets.

Wissenschaftlicher Artikel

Scientific Article

Lect. Notes Comput. Sc. 13205 LNCS, 171-184 (2022)

Recent work has established clear links between the generalization performance of trained neural networks and the geometry of their loss landscape near the local minima to which they converge. This suggests that qualitative and quantitative examination of the loss landscape geometry could yield insights about neural network generalization performance during training. To this end, researchers have proposed visualizing the loss landscape through the use of simple dimensionality reduction techniques. However, such visualization methods have been limited by their linear nature and only capture features in one or two dimensions, thus restricting sampling of the loss landscape to lines or planes. Here, we expand and improve upon these in three ways. First, we present a novel “jump and retrain” procedure for sampling relevant portions of the loss landscape. We show that the resulting sampled data holds more meaningful information about the network’s ability to generalize. Next, we show that non-linear dimensionality reduction of the jump and retrain trajectories via PHATE, a trajectory and manifold-preserving method, allows us to visualize differences between networks that are generalizing well vs poorly. Finally, we combine PHATE trajectories with a computational homology characterization to quantify trajectory differences.

Wissenschaftlicher Artikel

Scientific Article

In: (International Conference on Learning Representations, 25–29 April 2022, Virtual). 2022.

Graph neural networks (GNNs) are a powerful architecture for tackling graph learning tasks, yet have been shown to be oblivious to eminent substructures such as cycles. We present TOGL, a novel layer that incorporates global topological information of a graph using persistent homology. TOGL can be easily integrated into any type of GNN and is strictly more expressive (in terms the Weisfeiler–Lehman graph isomorphism test) than message-passing GNNs. Augmenting GNNs with TOGL leads to improved predictive performance for graph and node classification tasks, both on synthetic data sets, which can be classified by humans using their topology but not by ordinary GNNs, and on real-world data.

In: (International Conference on Learning Representations, 25–29 April 2022, Virtual). 2022.

Graph generative models are a highly active branch of machine learning. Given the steady development of new models of ever-increasing complexity, it is necessary to provide a principled way to evaluate and compare them. In this paper, we enumerate the desirable criteria for such a comparison metric and provide an overview of the status quo of graph generative model comparison in use today,which predominantly relies on the maximum mean discrepancy (MMD). We perform a systematic evaluation of MMD in the context of graph generative model comparison, highlighting some of the challenges and pitfalls researchers inadver-tently may encounter. After conducting a thorough analysis of the behaviour of MMD on synthetically-generated perturbed graphs as well as on recently-proposed graph generative models, we are able to provide a suitable procedure to mitigate these challenges and pitfalls. We aggregate our findings into a list of practical recommendations for researchers to use when evaluating graph generative models.