Graph Data Science: What you need to know


We look forward to presenting Transform 2022 in person again on July 19 and virtually from July 20 to 28. Join us for insightful conversations and exciting networking opportunities. Register today!

Whether you’re genuinely interested in uncovering insights and solving problems using data, or you’re simply drawn to what LinkedIn calls “the most promising career” and Glassdoor calls “the best job in America.” has been labeled, chances are you are familiar with data science. But what about graph data science?

As we have already pointed out, graphs are a universal data structure with manifestations covering a wide spectrum: from analytics to databases and from knowledge management to data science, machine learning and even hardware.

Graph data science, if you want to answer questions, isn’t just with your data, but with the connections between your data points — that’s the 30-second explanation, according to Alicia Frame.

Frame is Senior Director of Product Management for Data Science at Neo4j, a leading provider of graph databases. She has a PhD in Computational Biology and has been a practicing data scientist working with connected data for 10 years.

When she joined Neo4j about three years ago, she set out to build a best-in-class solution for handling connected data for data scientists. Today, the product Frame is the leader of Neo4j, aptly named Graph Data Sciencecelebrates its two-year anniversary with version 2.0, which brings some important innovations: new features, a native Python client and availability as a managed service under the name AuraDS on Google Cloud.

We caught up with Frame to discuss the concept of Graph Data Science and the Graph Data Science product.

The concept: Graph Data Science

The point of graph data science is to use relationships in data. Most data scientists work with data in tabular formats. However, to get better insights, to answer questions you can’t answer without using connections, or just to present your data more faithfully, graphics are key.

As Frame pointed out, this can mean using graph queries to find the patterns you know exist, or using unsupervised methods like graph algorithms to sift through data and find patterns to look at. It can also mean using supervised machine learning where you are actually classifying. What kind of chart is this? Or where will a relationship develop in the future?

The product: Graph Data Science

The Graph Data Science (GDS) product is a relatively new addition to the Neo4j ecosystem with a dual purpose. On the one hand, it wants to appeal to data scientists as well as business analysts and data analysts who have not necessarily been users of graph databases.

The key value proposition of GDS to them is that it not only gives them the ability to store connected data in a connected form, but also a single workspace and environment where they can do everything from data analysis to query of persistence to training and model development. said frame. There is no ETL involved as the data is already stored as a chart in Neo4j.

But then the GDS also aims to cater to Neo4j’s more traditional audience: developers. Frame referred to as Meredith Corporation used Neo4j to create their user journeys. As a follow-up to this use case, GDS was used to identify anonymous readers on their websites.

The use case originated from a longtime Neo4j developer who liked the product. This led to an investigation of ways to get more value out of it, and eventually to using GDS to solve a problem. “They were like — wait a second, that [graph] Algorithm solves this really complex application question that we have and it just fits nicely into our pipeline,” said Frame.

GDS’ data scientist-friendly interface

Ease of use of GDS for all potential users was a top priority for this release and GDS availability as a managed cloud offering is part of that. Neo4j has already made its managed cloud offering called Aura available on all major cloud platforms. After a few months of preview, GDS is now available on Google Cloud under the name AuraDS.

As Frame explained, AuraDS has been rebuilt from the ground up to offer a custom experience for data scientists. It’s based on the Aura substrate but with a different configuration, optimized for a different setup. That touches on many aspects.

From a technical perspective, data science workloads tend to be much more memory intensive and use more threads than database workloads. The team wanted to make sure they had the right configuration for data scientists to be successful, Frame said. But where most of the time and effort has gone into building a user interface that works for data scientists, she added.

Data scientists’ needs and skills differ from those of developers: they’re interested in getting value out of their data, discovering new insights, and building better predictive models, not in setting up or maintaining a database. AuraDS has a completely rebuilt user interface that makes the user experience more user-friendly for data scientists, Frame said.

She gave the example of helping users with sizing guidelines: getting estimates of the number of nodes and edges in the graphs they want to work with, the algorithms they want to run, and recommendations for the resources they need. Frame also said it added a number of metrics relevant to data scientists, such as CPU usage and memory usage.

Meet data scientists where they are

Another major improvement is the native Python client. First, because it allows data scientists to work directly from Python, which is the most popular choice for them, rather than having to go through Cypher, Neo4j’s query language. Second, because this allows to work with both AuraDS and GDS directly from notebooks and get results via dataframes instead of having to go through Neo4j’s UI. Users can choose what works best for them.

This illustrates a broader point for AuraDS: its general availability, advanced features now also available in GDS. Another example of this is persistence and backup, which is powered by AuraDS but is now also available on self-managed GDS. As Frame acknowledged, working in storage is a double-edged sword. It allows for fast processing of large-volume charts, but it also brings some concerns.

First, if the results of processing need to be persisted, then the user needs to take care of that. Second, if there is a failure before processing is complete, the work is lost and has to start over. Frame said it wasn’t a big problem because running graph algorithms in memory is fast and there are safeguards in place to prevent database tipping. However, it helps if the intermediate state persists.

Compatibility and synchronization

There are other operational improvements as well. GDS is now more compatible with transactional clusters. That means you don’t have to worry about copying data from your cluster to a single instance or getting data back into your cluster from that dedicated data science instance, Frame said.

That worry goes away, and you end up with nothing that isn’t configured for both workloads, she added. So you can now attach a dedicated GDS node to your cluster. It automatically gets this updated data in real time.

Data science workloads can run without impacting transactional workloads, and synchronization is handled internally, so you don’t have to worry about ETL. Frame highlighted this improvement, saying customers are picking this up and running it before it’s even released. Also, instances can now be paused, reducing costs without losing results.

Integrations and improvements

GDS 2.0 also brings more machine learning and AutoML capabilities. The ability to create ML pipelines for tasks such as link prediction is introduced. This means you can fill in missing relationships in your diagram or node classification; for example, filling in missing labels, e.g. B. Characterizing transactions as fraudulent or normal.

Frame described how GDS introduces the concept of a pipeline catalog. This allows users to indicate that they want to train a model for a specific end goal, and GDS then assists them with intermediate steps such as generating embeddings and choosing the best performing model.

This also ties into a broader story: integrations, and specifically integration with Google and its Vertex AI platform. Neo4j and Google are partners and that is why AuraDS was first introduced on Google Cloud. In addition, AuraDS and Vertex AI can be integrated, and there has been and will be collaboration and evangelization between Neo4j and Google, Frame said.

New integrations are important additions to GDS/AuraDS. As Frame pointed out, data scientists don’t work in a vacuum, so helping them get data in and out of GDS is crucial. GDS 2.0 supports Neo4j connectors with Apache Spark and BI tools like Microsoft Power BI, Tableau and Looker. In addition, integrations with Dataiku and KNIME have been added.

Last but not least, GDS 2.0 brings new algorithms and improvements to existing ones. Breadth-first search, depth-first search, K-nearest neighbors, delta stepping, and similar functions have now reached the level of “tiering,” according to Neo4j.

The big picture

Overall, the GDS is getting a significant upgrade and overhaul. The launch of AuraDS brings the benefits of the cloud while driving GDS forward. Frame said that GDS saw over 370% annual growth in the number of enterprise customers and hundreds of thousands of downloads. GDS 2.0 and AuraDS bring Graph Data Science one step closer to mainstream adoption.

VentureBeat’s mission is intended to be a digital marketplace for technical decision makers to acquire knowledge about transformative enterprise technology and to conduct transactions. Learn more about membership.


Comments are closed.