While feature vectors and pixelated images are being used as standard proxies for presenting data in machine learning, many objects encountered in a scientific machine learning context can only be suitably modeled as discrete objects containing non-sequential structures and categorical attributes. As an example, a molecule is often visualized as a collection of atoms and bonds that only fits into a graph-based data structure.

To apply machine learning techniques to unstructured datasets, the project will make use of and further develop GraphDot (https://pypi.org/project/graphdot/(link is external)), a Python package for bridging graph-based databases to a wide array of kernel-based machine learning methods. The project will be expanded at multiple fronts, including ML algorithm design, software implementation and optimization. Applications to real-world scientific problems will be carried out to predict properties of molecules and crystal structures that are of importance in energy-related and pharmaceutical contexts.

Problem

Kernels offer a wide range of applications in machine learning, for instance as similarity functions in pattern analysis or covariance functions in defining stochastic processes. Currently, mainstream Python machine learning packages only provide predefined kernels or package-exclusive modules for kernel composition. These stock kernel implementations are also limited to Euclidean/vectorized data, despite the vast generalizability of kernels on other spaces.

Goal

My project is KernelBridge (tentatively named), a package aiming to resolve the issues around interoperability and the limited scope of application in today’s kernel modules. The mission statement of the package is twofold:

  1. Create and compose bespoke kernels that are interoperable with mainstream Python machine learning packages such as scikit-learn and GPflow
  2. Extend kernels via R-convolution (Haussler, 1999) to discrete structures, unlocking many potential uses in graph learning, NLP, computer vision, and more.

Data

Because the project is more general and less model-driven, direct data analysis was not a major focus. Most data is constructed solely for the purposes of unit testing, in which kernels are computed on small arrays of or standalone floating-point numbers and cross-checked, while the R-convolution kernels are being tested on arrays of manually constructed gene sequences and checked for positive semi-definiteness (which is itself a difficult task). This “data” is susceptible to the usual pitfalls of software testing. Failure to consider edge cases, or oppositely, focusing too closely on edge cases and failing to generalize, can lead to a codebase that does not lend itself to modularization and extensibility. Especially because this package is open source, these two aspects are centrally important to the project.

Implementation

The package utilizes typical object-oriented programming paradigms to provide users with an intuitive module to create a bespoke Kernel class with a provided expression for computation and hyperparameters, which can then be instantiated for actual usage. The package also implicitly defines a domain-specific language for the composition of kernels, making use of the Python __str__ and __repr__ magic methods. This standardized set of “rules” provides robustness and extensibility to the code.

Interoperability is handled elegantly with Python’s duck-typing capabilities. Kernel objects are wrapped with any native methods and attributes necessitated by the architecture of other packages. Computation, composition, and other intrinsic object functionality remains self-contained, while relevant attributes and function calls are seamlessly exported as needed. This allows kernel objects to remain lightweight and draw upon KernelBridge’s (hopefully) superior computation architecture while still achieving out of the box plug-and-playability with other packages.

The package also provides a module for R-convolution, which is currently in the early stages of development. The R-convolution kernel decomposes its inputs into "parts," then applies and convolves kernels computed on pairs of these parts. The crux of this R-convolution is the relation R, which rigorously defines the parts of an object, e.g. random walks of a graph or substrings satisfying a regular grammar. So far, the module is operational on regular grammars with simple alphabets. Rather trivially and inefficiently, it enumerates all subsequences of the input pair valid under a supplied regular grammar and performs the convolution on pairs of these subsequences.

In the future, the package will allow users to define the relation R with a generalized context-free grammar. This context-free grammar will be translated into a deterministic finite-state machine, which will in turn generate valid parts under R as inputs for the convolved “feature” kernels. This probabilistic sampling technique will lead to vastly improved runtime efficiency, which can be further improved using (GPU) parallelization with CUDA. Additionally, the part generation procedure can be improved iteratively to produce increasingly representative parts by importance sampling.

Usage

[[{"fid":"1939","view_mode":"default","fields":{"class":"media-element file-width-400 openberkeley-theme-brand-image-focused","data-delta":"1","format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"1":{"class":"media-element file-width-400 openberkeley-theme-brand-image-focused","data-delta":"1","format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"class":"media-element file-default","data-delta":"1"}}]]

[[{"fid":"1940","view_mode":"default","fields":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false},"type":"media","field_deltas":{"2":{"format":"default","field_file_image_alt_text[und][0][value]":false,"field_file_image_title_text[und][0][value]":false}},"attributes":{"class":"media-element file-default","data-delta":"2"}}]]

Impact

The creation and composition modules are fully functional as far as current unit tests can verify. Interoperability has presented some issues, particularly because GPflow’s kernel architecture is based on accepting TensorFlow tensors possibly with inactive dimensions as input. However, this is a relatively small bug that is unrelated to the general idea behind the wrapper modules. Finally, the baseline implementation for R-convolution passes smoke tests for kernel computation on arrays of gene sequences. That said, this baseline module only uses Kronecker deltas, a regular grammar, and deterministic enumeration of subsequences rather than the ultimate goal of using a finite-state machine approach. Nonetheless, the progress is very promising, and much of the groundwork for improvements has been laid.

As stated earlier, this package aims to solve the limited usability of today’s kernel modules. Hopefully it can become, or at least provide a step in the direction of, a unified module for creating and composing kernels for universal usage across all machine learning packages and frameworks. A large-scale, extensible package for kernels, especially one that provides an extension to discrete structures, can unlock the myriad potential uses of kernels in graph learning, active learning on molecules, NLP, computer vision, and much more.

Conclusion

The project is still underway, but I am proud of the progress made thus far. This has been an unparalleled opportunity to learn more about not just the mathematics of positive semi-definite kernels, but also the (open source) software engineering paradigms that are quickly increasing in relevance even for careers outside of the tech stack. I am excited to continue developing KernelBridge, particularly the R-convolution module. For all the challenges it presents, both theoretically and computationally, there are so many exciting things to learn and implement. I would like to express my utmost gratitude to my Discovery Partner Yu-Hang Tang for being an incredible mentor, even beyond the scope of the project, and to all those helping to administrate and provide resources for Discovery students.

Drake Wong

Term
Spring 2020
Topic
Data Visualizations
Platforms/Infrastructure