Google launches new search engine to help scientists find the datasets they need

Google’s goal has always been to organize the world’s
information, and its first target was the commercial web. Now,
it wants to do the same for the scientific community with a new
search engine for datasets.

The service, called
Dataset Search
, launches today, and it will be a companion
of sorts to Google
Scholar
, the company’s popular search engine for academic
studies and reports. Institutions that publish their data
online, like universities and governments, will need to include
metadata tags in their
webpages that describe their data, including who created it,
when it was published, how it was collected, and so on. This
information will then be indexed by Google’s search engine and
combined with information from the Knowledge Graph. (So if
dataset X was published by CERN, a little information about the
institute will also be included in the search.)

A search engine to unite the fragmented world of online
datasets

Speaking to The Verge, Natasha Noy, a research
scientist at Google AI who helped created Dataset Search, says
the aim is to unify the tens of thousands of different
repositories for datasets online. “We want to make that data
discoverable, but keep it where it is,” says Noy.

At the moment, dataset publication is extremely fragmented.
Different scientific domains have their own preferred
repositories, as do different governments and local
authorities. “Scientists say, ‘I know where I need to go to
find my datasets, but that’s not what I always want,’”
says Noy. “Once they step out of their unique community, that’s
when it gets hard.”

Noy gives the example of a climate scientist she spoke to
recently who told her she’d been looking for a specific dataset
on ocean temperatures for an upcoming study but couldn’t find
it anywhere. She didn’t track it down until she ran into a
colleague at a conference who recognized the dataset and told
her where it was hosted. Only then could she continue with her
work. “And this wasn’t even a particularly boutique
depository,” says Noy. “The dataset was well written up in a
fairly prominent place, but it was still difficult to find.”


An example search for weather
records in Google Dataset Search
.Image:
Google

The initial release of Dataset Search will cover the
environmental and social sciences, government data, and
datasets from news organizations like ProPublica.
However, if the service becomes popular, the amount of data it
indexes should quickly snowball as institutions and scientists
scramble to make their information accessible.

This should be helped by the recent flourishing of open data
initiatives around the world. “I do think in the last several
years the number of repositories has exploded,” says Noy. She
credits the increasing importance of data in scientific
literature, which means journals ask authors to publish
datasets, as well as “government regulations in the US and
Europe and the general rise of the open data movement.”

“I’m hopeful that Google stepping in will make it
easier.”

Having Google involved should help make this project a success,
says Jeni Tennison, CEO of the Open Data Institute (ODI).
“Dataset search has always been a difficult thing to support,
and I’m hopeful that Google stepping in will make it easier,”
she says.

To create a decent search engine, you need to know how to build
user-friendly systems and understand what people mean when they
type in certain phrases, says Tennison. Google obviously knows
what it’s doing in both of those departments.

In fact, says Tennison, ideally Google will publish its own
dataset how Dataset Search gets used. Although the metadata
tags the company is using to make datasets visible to its
search crawlers are an open standard (meaning that any
competitor like Bing or Yandex can also use them and build a
competing service), search engines improve most quickly when a
critical mass of users is there to provide data on what they’re
doing.

“Simply understanding how people search is important… what
kind of terms they use, how they express them,” says Tennison.
“If we want to get to grips with how people search for data and
make it more accessible, it would be great if Google opened up
its own data on this.”

In other words: Google should publish a dataset about dataset
search that would be indexed by Dataset Search. What could be
more appropriate?

Leave a Reply

Your email address will not be published. Required fields are marked *