Highlights:

  • The open-source LakeFS technology gets a fully managed cloud service to help organizations better iterate and version cloud data sources for development efforts.
  • The LakeFS cloud will initially be accessible on AWS, and Google and Microsoft Azure support will be added in the coming months.

Treeverse recently unveiled its LakeFS Cloud service, a managed product that would provide companies access to versioning tools for cloud data lakes. The company stated that the service is anticipated to be generally accessible on June 27.

Users can store various sorts of data with a cloud data lake, but there is typically little to no tracking of how the data changes over time and no simple way to go back to an older version.

Similar to how the Git version control system helps developers track and build versions of application code, Treeverse developed the open-source project in 2020 to enable versioning for a data lake. Instead of requiring consumers to deploy and manage their cloud service themselves, the vendor wants to offer a cloud service that is managed and deployed by itself.

Treeverse is competing with several rivals, such as the Dremio-run Nessie open-source project and the AWS Lake Formation service, which offers rudimentary versioning and data cataloging features. The LakeFS cloud will initially only be accessible on AWS and plans to add support for Google and Microsoft Azure in the coming months.

Versioned data lake and medical field

Healthcare start-up Karius, based in Redwood City, California, is one of the users of the open-source LakeFS technology. Karius has created a solution combining chemistry and AI to diagnose infectious diseases without requiring invasive surgery.

“As you can imagine, such a complex technology is fueled by massive amounts of complex data that comes with every patient,” said Sivan Bercovici, CTO of Karius. “To go from what’s in the tube, to what’s in the cloud, to what’s in a physician report, the chain of custody of data needs to be secured.”

Bercovici believed that in the age of data and precision medicine, many organizations have become used to the idea of never destroying any data.

Karius utilizes LakeFS because managing all the data is increasingly becoming challenging as complexity increases. According to Bercovici, his company is aware that when it updates its essential data on LakeFS, just as it versions its code, it can count on that data being accessible and discoverable.

“LakeFS brings the much-needed focus in the clouded data space, which is the daily reality of pharma and biotech,” Bercovici said. “We went from weeks’ worth of data hunting, and anxiety around whether or not we got the right data version, to simply being able to rely on the availability of the right data, to the right data scientist, at the right time. It is liberating.”

For the present, Karius self-hosts LakeFS and, to ease management, intends to move to the cloud offering in the future.

“As a rapidly growing company, we want to make sure someone who is deeply versed in the specific technology has you covered for uptime and develops efforts while we focus on building our differentiated value,” Bercovici said.

How does LakeFS work to version cloud data lakes?

Co-founder and CEO of Treeverse, Einat Orr, stated that LakeFS aims to enable enterprises to employ engineering best practices used in code development for data lakes.

Some of the best practices of LakeFS include having several versions or branches for data that permit users to interact with any branch. The system also supports reversion, allowing users to return to an earlier version if there’s an update. Another fundamental function that LakeFS offers is the capacity to merge various branches.

The open-source LakeFS technology necessitates just a server, a database, and access to storage. Although users can and have set it up independently using a self-hosted approach, maintaining LakeFS in an ideal deployment can be difficult and time-consuming.

Here, the new LakeFS cloud service steps in as a managed service that takes care of users’ LakeFS deployment and operation needs. As part of the AWS deployment, the gateway component of the LakeFS cloud enables organizations to securely connect to and access a company’s data lake using AWS PrivateLink.

According to Orr, the ability to version data in a data lake can help development efforts and data quality, which can be challenging to troubleshoot.

“The moment you have the quality of your data questioned, the process is manual, difficult, and hard to manage, and this is where the value of LakeFS shines,” Orr said. “LakeFS allows reproducibility and reversion capabilities, and it can support working in isolation for development and debugging.”