Google Cloud launches BigLake, a new cross-platform data storage engine – TechCrunch

At its Cloud Data Summit, Google today announced the preview release of BigLake, a new data lake storage engine that makes it easier for companies to analyze the data in their data warehouses and data lakes.

The idea here, essentially, is to take Google’s experience with running and managing its BigQuery data warehouse and extend it to data lakes in Google Cloud Storage, combining the best of data lakes and warehouses into one. service that abstracts the underlying storage. formats and systems.

It’s worth noting that this data could also be in BigQuery or live in AWS S3 and Azure Data Lake Storage Gen2. Through BigLake, developers will gain access to a consistent storage engine and the ability to query underlying data stores through a single system without moving or duplicating data.

Managing data across disparate data lakes and warehouses creates silos and increases risk and cost, especially when data needs to be moved,” said Gerrit Kazmaier, vice president and general manager of databases, data analytics and business intelligence at Google Cloud. ., he points out in today’s announcement. “BigLake enables companies to unify their data warehouses and data lakes to analyze data without worrying about the underlying storage system or format, eliminating the need to duplicate or move data from one source and reducing costs and inefficiencies.”

Image credits: Google

Using policy tags, BigLake allows administrators to configure their security policies at the table, row, and column levels. This includes data stored in Google Cloud Storage, as well as the two supported third-party systems, where BigQuery Omni, Google’s multi-cloud analytics service, enables these security controls. Those security controls also ensure that only the right data flows into tools like Spark, Presto, Trino, and TensorFlow. The service also integrates with Google’s Dataplex tool to provide additional data management capabilities.

Google notes that BigLake will provide fine-grained access controls and that its API will span Google Cloud, as well as file formats such as open column-oriented Apache Parquet and open source rendering engines such as Apache Spark.

Image credits: Google

“The volume of valuable data that organizations have to manage and analyze is growing at an incredible rate,” explain Google Cloud Software Engineer Justin Levandoski and Product Manager Gaurav Saxena in today’s announcement. “This data is increasingly distributed across many locations, including data warehouses, data lakes, and NoSQL stores. As an organization’s data becomes more complex and proliferates across disparate data environments, silos emerge, creating increased risk and cost, especially when that data needs to be moved. Our clients have made it clear; they need help.”

In addition to BigLake, Google also announced today that Spanner, its globally distributed SQL database, will soon get a new feature called “change streams.” With these, users can easily track any changes to a database in real time, be it inserts, updates, or deletes. “This ensures customers always have access to the most up-to-date data, as they can easily replicate changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior via Pub/Sub, or store changes in Google Cloud. Storage (GCS) to comply with regulations”. Kazmayer explains.

Google Cloud today also brought Vertex AI Workbench, a tool for managing the entire lifecycle of a data science project, out of beta and into general availability, and launched Connected Sheets for Looker, as well as the ability to access models of Looker’s data in its Data Study Intelligence Tool.

Leave a Comment