May 26, 2022

At the Cloud Data Summit, Google today announced the pre-launch of BigLake, a new data lake storage engine that makes it easier for enterprises to analyze the data in their data warehouses and data lakes.

The idea here, essentially, is to extend Google’s experience of running and managing its BigQuery data warehouse to a data lake in Google Cloud Storage by bringing the best of data lake and storage into a single service that powers the underlying storage. format and system.

It’s worth noting that this data can be in BigQuery or AWS S3 and Azure Data Lake Storage Gen2. With BigLake, developers have access to a unified storage engine and the ability to search the underlying data store through a single system without having to move or duplicate data.

IManaging data across lakes and repositories creates silos and increases risks and costs, especially when data needs to be moved,” explains Gerrit Kazmir, VP and General Manager, Database, Data Analytics and Business Intelligence, Google Cloud., according to today’s announcement. IBigLake allows companies to integrate their data warehouses and data analytics lakes without worrying about the underlying format or storage system, eliminating the need to duplicate or move data from a single source and cost, and reduce inefficiencies.

image credit: Google

Using policy tags, BigLake allows administrators to customize their security policies at the table, row, and column level. This includes data stored in Google cloud storage as well as two supported third-party systems where BigQuery Omni, Google’s multi-cloud analytics service, includes these security checks. These security controls also ensure that only the correct data is passed to tools like Spark, Presto, Trino, and TensorFlow. The service also integrates with Google Dataplex tools to provide additional data management capabilities.

Google notes that BigLake will provide fine-grained access control and that its API will include Google Cloud, as well as file formats such as open-column-oriented Apache Parquet, and open-source processing engines such as Apache Spark.

image credit: Google

“The amount of valuable data that organizations manage and analyze is growing at an incredible rate,” Google Cloud Software Engineer Justin Lewandowski and Product Manager Gaurav Saxena explained in today’s announcement. “This data is rapidly spreading across multiple places, including data warehouses, data lakes, and NoSQL stores. As an organization’s data becomes more complex and spread across different information environments, siled repositories are created, increasing risk and cost, especially when that data needs to be moved. Our clients have made it clear; they need help.”

In addition to Biglake, Google also announced today that Spanner, the globally distributed SQL database, will be getting a new feature called “Change Streams” soon. This allows users to easily track changes to the database in real time, whether inserts, updates, or deletions. “This ensures that customers always have access to the latest data as they can easily replicate changes from Spanner to BigQuery for real-time analytics, trigger downstream application behavior using Pub/Sub, or save changes to Google Cloud Storage (GCS) for agreement. Kazmir.

Google Cloud also today released Vertex AI Workbench, a complete data science project lifecycle management tool that is out of beta and available to the public, as well as launching Connected Sheets for Looker, as well as its own data studio. Looker data model. BI tools.

Leave a Reply

Your email address will not be published.