Time and Location: 2/5/2020, Wed., 4 p.m., CS Conference Room 206 Speaker: Mr. Chenxu Niu Title: Data Discovery Service for Self-describing Datasets Abstract: The ability of discovering datasets of interest is arguably and growingly critical to draw insights and generate value from data. However, it is often a daunting and highly challenging task for users to find datasets relevant to their needs, just like finding a needle in a haystack. There are several existing efforts to promote data discoverability, such as Google dataset search. These existing dataset search services, however, are largely keyword based and often require dataset owners to provide metadata manually and calibrate them to meet the search service’s requirements. These processes are tedious and can easily lead to low-quality metadata that can have adverse effect on search service. In our research, and for the first time, we propose a data discovery service for self-describing scientific datasets, which offers an automated data discovery service based on built-in, high-quality metadata in self-describing data file formats, such as HDF5 and netCDF. This talk will present the idea of data discovery service and introduce a similarity model for determining similarity among self-describing datasets to discover datasets of interest. This talk will introduce the current design of our data discovery service and discuss current results and findings as well.