Researchers at Brown University and MIT have developed a new data science framework that enables users to manipulate data in the Python programming language. In general, you don’t have to pay the “utility tax” associated with user-friendly languages.
With a new framework called Tuplex, you can: Process data Queries written in Python are up to 90 times faster than industry standard data systems like Apache Spark and Dask. The research team introduced the system in a study presented at the leading data processing conference SIGMOD 2021 and made the software freely available to everyone.
“Python is the leading programming language used by data science practitioners,” says Malte Schwarzkopf, assistant professor of computer science at Brown and one of the developers at Tuplex. “It makes a lot of sense. Often taught in colleges, Python is an easy language to get started with. But when it comes to data science, it has to do with Python as the platform cannot handle Python efficiently. There is a high power tax. Back end. “
Platforms such as Spark data analysis Distribute tasks across multiple processor cores or machines in a data center. Parallel Processing Users can process huge data sets that stifle a single computer. Users interact with these platforms by entering their own queries that contain custom logic written as “custom functions” or UDFs. The UDF specifies custom logic, e.g. For example, extracting the number of bedrooms from the text of the property list for a query that searches all property listings in the United States and selects all three bedroom property listings. ..
Because of its simplicity, Python is the language of choice for creating UDFs in the data science community. In fact, the Tuplex team cites a recent survey that shows that 66% of data platform users use Python as their primary language. The problem is that the analytics platform has a problem handling these bits of Python code efficiently.
The data platform is written in a high level computer language that is compiled prior to execution. A compiler is a program that takes a computer language and converts it into machine code. This is a set of instructions that a computer processor can perform quickly. However, Python is not precompiled. Instead, the computer interprets the Python code line by line while the program is running, which can significantly reduce performance.
“These frameworks need to break out of the efficient execution of compiled code and jump to the Python interpreter to run the Python UDF,” said Schwarzkopf. “The process can be 100 times less efficient than running the compiled code.”
If you can compile your python code it will be a lot faster. However, researchers have been trying to develop a generic Python compiler for years with little success. So instead of writing a generic Python compiler, the researchers developed Tuplex to compile a highly specialized program for specific queries and general input data. Rare input data that make up a small part of the instance are isolated and referenced by the interpreter.
“This process is known as dual case processing because it divides the data into two cases,” said Leonhard Spiegelberg, co-author of the study that Tuplex describes. “This simplifies compilation problems because we only have to worry about a single type of data and a set of common assumptions. This results in high productivity and fast execution speed. 2 Use the strengths of a world. “
The runtime advantages can also be considerable.
“Our research shows that we can reduce the waiting time for output to 10 minutes,” says Schwarzkopf. “So it’s really a significant increase in performance.”
Researchers say Tuplex not only speeds things up, but also provides an innovative way to deal with anomalous data. Large datasets are often cluttered with corrupted records and data fields that do not conform to the rules. For example, in real estate data, the number of bedrooms is either a number or a written number. Such inconsistencies can be enough to cause some data platforms to crash. However, Tuplex extracts these anomalies and sets them aside to avoid crashes. When the program is running, the user has an opportunity to repair these anomalies.
“We believe this can have a significant impact on the productivity of data scientists,” says Schwarzkopf. “It’s really a big problem that you don’t have to walk out to have a cup of coffee while waiting for the dispenser, and let the program run for an hour only to crash before it’s done.”
AI for code makes collaborative and open scientific discovery easier
Paper: cs.brown.edu/people/malte/pub/… 21-sigmod-tuplex.pdf
Quote: The new data science platform is a Python query released on July 6, 2021 (July 1, 2021) from https://techxplore.com/news/2021-07-science-platform-python-queries.html was retrieved. ) Accelerate
This document is subject to copyright. No part may be reproduced without written permission except in fair transaction for personal investigation or research. The content is provided for informational purposes only.