Charting large materials dataspaces: AI methods and scalability

Course/Event Essentials

Event/Course Start
Event/Course End
Event/Course Format
In person

Venue Information

Country: France
Venue Details: Click here

Training Content and Scope

Level of Instruction
Beginner
Intermediate
Advanced
Sector of the Target Audience
Research and Academia
Industry
HPC Profile of Target Audience
Application Users
Application Developers
Data Scientists
Language of Instruction

Other Information

Supporting Project(s)
NOMAD
Event/Course Description

Organisers

  • Luca Ghiringhelli (NOMAD Laboratory at the Fritz Haber Institute of the Max Planck Society and Humboldt University, Berlin)
  • James Kermode (University of Warwick)
  • Markus Rampp (Max Planck Computing and Data Facility (MPCDF))

Across a wide range of fields and in particular in materials science, there is increasing awareness that big data is a fundamental resource for fostering deeper understanding of physical systems and ensuring reproducibility of calculations.

It is crucial to realize that “big”  does not refer only to the sheer amount of data, but also to their complexity. For example, in materials science, a material is typically characterized by an intricate hierarchy of observables including ensemble averages at various thermodynamic conditions. Another crucial aspect is the need to validate and quantify uncertainty, i.e., being able to assign to any single entry in the database a level of accuracy so that data points from disparate sources can be used concurrently in an analysis.

Such awareness has motivated the creation of large computational materials-science databases. Some are “project-based”, i.e., collections of high-throughput scans of given materials classes (e.g., AFLOW [1], Materials Project [2], OQMD [3]), others collect data from heterogeneous sources (e.g., NOMAD [4], Materials Cloud [5]).

In order for the data to be (re-)usable for new analyses and possibly discoveries, they have to comply with  the so-called FAIR (findable - accessible - interoperable - reusable/repurposable/recyclable) principles [6].  

This requires complex, hierarchical metadata structures that annotate the data, so that the users know the provenance (settings, purpose) of a calculation in order to judge whether an entry can be part of a dataset to be analysed [7].

The complexity and extent of the existing databases, which can only grow in both respects, reveals a rarely addressed challenge: the possibility to efficiently explore the databases themselves in order to reveal patterns and trends.

Here, exploration refers specifically to the possibility of producing dynamic, visual maps of the databases’ content. For instance, a user may be looking for ternary materials, not containing radioactive species, and would like to understand how diverse are the entries, i.e., whether they are somewhat uniformly spanning the materials space or are clustered into classes, where understanding what is common among class’ members is a challenge in itself.

This and similar kinds of questions call for interactive, dynamic, and intelligent (i.e., artificial-intelligent-driven) tools, which are also efficient, i.e., they are able to propose a meaningful solution within seconds. 

In summary, in order to harvest the yet unhearted richness contained in presently known and future materials-science data, four pillars need to be concurrently developed:

  • FAIR-compliant materials databases
  • Identification of proper descriptors and metrics for capturing the similarity amongst materials, including the complex restructuring occurring at varying environmental conditions [8]
  • Artificial-intelligence (AI) approaches for exploratory analysis: clustering, dimension reduction and corresponding visualization that can reveal hidden patterns [9]
  • Scalable implementations, combining clever choice of the hardware as well as algorithmic speed-up (e.g., landmarking) [10]

In this workshop, experts in all these aspects, not necessarily limited to materials-science applications, will interact to confront ideas and solutions for performing flexible, interactive, efficient, and insightful analyses of materials databases.

References

[1] S. Curtarolo, W. Setyawan, G. Hart, M. Jahnatek, R. Chepulskii, R. Taylor, S. Wang, J. Xue, K. Yang, O. Levy, M. Mehl, H. Stokes, D. Demchenko, D. Morgan, Computational Materials Science, 58, 218-226 (2012)
[2] A. Jain, S. Ong, G. Hautier, W. Chen, W. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, K. Persson, APL Materials, 1, 011002 (2013)
[3] J. Saal, S. Kirklin, M. Aykol, B. Meredig, C. Wolverton, JOM, 65, 1501-1509 (2013)
[4] C. Draxl, M. Scheffler, MRS Bull., 43, 676-682 (2018)
[5] L. Talirz, S. Kumbhar, E. Passaro, A. Yakutovich, V. Granata, F. Gargiulo, M. Borelli, M. Uhrin, S. Huber, S. Zoupanos, C. Adorf, C. Andersen, O. Schütt, C. Pignedoli, D. Passerone, J. VandeVondele, T. Schulthess, B. Smit, G. Pizzi, N. Marzari, Sci. Data., 7, 299 (2020)
[6] M. Wilkinson, M. Dumontier, I. Aalbersberg, G. Appleton, M. Axton, A. Baak, N. Blomberg, J. Boiten, L. da Silva Santos, P. Bourne, J. Bouwman, A. Brookes, T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. Evelo, R. Finkers, A. Gonzalez-Beltran, A. Gray, P. Groth, C. Goble, J. Grethe, J. Heringa, P. ’t Hoen, R. Hooft, T. Kuhn, R. Kok, J. Kok, S. Lusher, M. Martone, A. Mons, A. Packer, B. Persson, P. Rocca-Serra, M. Roos, R. van Schaik, S. Sansone, E. Schultes, T. Sengstag, T. Slater, G. Strawn, M. Swertz, M. Thompson, J. van der Lei, E. van Mulligen, J. Velterop, A. Waagmeester, P. Wittenburg, K. Wolstencroft, J. Zhao, B. Mons, Sci. Data., 3, 160018 (2016)
[7] L. Ghiringhelli, C. Carbogno, S. Levchenko, F. Mohamed, G. Huhs, M. Lüders, M. Oliveira, M. Scheffler, npj. Comput. Mater., 3, 46 (2017)
[8] A. Bartók, S. De, C. Poelking, N. Bernstein, J. Kermode, G. Csányi, M. Ceriotti, Sci. Adv., 3, e1701816 (2017)
[9] M. Ceriotti, J. Chem. Phys., 150, 150901 (2019)
[10] S. Idreos, O. Papaemmanouil, S. Chaudhuri, Overview of Data Exploration Techniques, 2015