Leveraging Blockchain for Federated Data Lakes: An overview

Authored by Almaviva 

Blockchain technology, once synonymous only with cryptocurrencies, has now permeated various sectors, including data management. This article introduces the benefits of incorporating blockchain technology in federated data lakes for data sharing, the differences between public and private blockchains, and the concept of Layer 2 solutions with some examples. At the end of the article, you’ll find an update of the current status of our experiments.

Blockchain and Federated Data Lakes

Federated Data Lakes

Federated data lakes are a type of data architecture that allows for the sharing and distribution of data across different organizations. Unlike traditional data lakes that store all data in a single, centralized location, federated data lakes distribute data across multiple storage systems. This allows organizations to maintain control over their own data while still making it accessible for shared use.

Blockchain Technology

Blockchain technology, on the other hand, is a type of distributed ledger technology. It creates a decentralized platform where all transactions are recorded in a secure and transparent manner. Each transaction is stored in a block and linked to the previous one, forming a chain. This makes the records immutable, meaning they cannot be altered or deleted once added to the blockchain.

Integration of Federated Data Lakes and Blockchain

When federated data lakes are integrated with blockchain technology, it creates a powerful tool for data sharing and collaboration.

  1. Data Integrity: The immutability of blockchain records ensures that once data is added to the blockchain, it cannot be altered or deleted. This guarantees the integrity of the data, making it a reliable source of information for all participants.
  2. Transparency and Trust: The distributed nature of blockchain promotes transparency as all participants have access to the same information. This fosters trust among participants as they can verify the data independently.
  3. Security: Blockchain’s decentralized structure and cryptographic techniques provide a high level of security. It’s nearly impossible for hackers to alter the data as they would need to change the information on more than half of the network’s nodes, which is computationally impractical.
  4. Efficient Data Sharing: The combination of federated data lakes and blockchain allows for efficient data sharing. Data can be accessed in real-time, and the use of smart contracts can automate the data-sharing process, reducing the need for manual intervention.

Public vs. Private Blockchains

Public blockchains allow open access, enabling anyone to join the network, participate in consensus processes, and validate transactions. Private blockchains, in contrast, limit access to a specific group of participants with defined roles and permissions, ensuring only authorized entities can participate.

Nowadays, Public blockchains are transitioning from resource-intensive consensus mechanisms like Proof of Work (PoW) to more scalable and energy-efficient alternatives such as Proof of Stake (PoS). PoS achieves consensus by selecting validators based on the number of tokens they hold or are willing to “stake.” Private blockchains often use even more efficient consensus mechanisms like Practical Byzantine Fault Tolerance (PBFT) or Proof of Authority (also a BFT solution), which are well-suited for smaller, controlled environments and provide faster transaction confirmations.

The main benefit of Public blockchains is the numerous nodes contributing to network security and decentralization. Larger networks make it more difficult for any single entity to manipulate the blockchain. However, with their smaller network sizes, private blockchains may have less security and decentralization, making them more vulnerable to collusion and tampering.

Talking about privacy, Public blockchains store and openly share transaction data, making it accessible to all network participants. This transparency can be advantageous but also raises privacy concerns. In contrast, private blockchains can limit data visibility to specific participants, offering enhanced privacy and control over sensitive information.

Use Cases

Public blockchains are ideal for trustless environments where participants don’t need to trust one another, knowing they can just verify every transaction with security provided by default by the technology.

Private blockchains are better suited for situations requiring a sort of trust among participants, such as within consortiums or organizations, enabling efficient collaboration, data sharing, and streamlined processes. Users can still verify transactions but due to the small nature of the network, they can’t be sure none colluded to change something.

Making the Right Choice

The choice between public and private blockchains for a federated data lake largely depends on the specific requirements of the use case. If the data lake contains sensitive information that should only be accessible to certain entities, a private blockchain may be more suitable. However, a public blockchain may be the better choice if full transparency and trust are required.

Taking the best of both worlds, enter the Layer 2

Layer 2 refers to a secondary framework or protocol that is built on top of an existing blockchain. The primary purpose of these protocols is to solve the scalability and privacy issues that plague blockchain networks without compromising their security.

Private blockchains, used primarily by businesses and organizations for internal purposes, greatly benefit from Layer 2 solutions. These solutions enhance the privacy of transactions, a critical requirement for businesses that need to protect sensitive data. By keeping transaction data on the private network and only recording the final state on a public blockchain, Layer 2 solutions ensure that private information remains confidential.

Periodic State Anchoring

Periodic State Anchoring involves periodically creating a snapshot of the private blockchain’s state, represented by a cryptographic hash, at specified intervals. This snapshot is then recorded onto a public blockchain, ensuring a tamper-resistant record of the private blockchain’s state at various points in time. In case of suspected tampering, participants can verify the private blockchain’s integrity by comparing its current state with the previously anchored snapshots on the public blockchain. It’s a protocol that is easy to implement and cheaper to use than other solutions.

One of the main caveats of this approach is the “a posteriori” verification of the private blockchain state. This means that the validity of the private blockchain’s state is verified after it has been anchored to the public blockchain. This can pose a problem because if a malicious or faulty transaction is included in the state that is anchored, it won’t be detected until after the fact. By the time the faulty state is discovered, the incorrect information may have already been used or acted upon, leading to potential issues or inaccuracies within the blockchain.

Zero Knowledge Rollups

Zero Knowledge Rollups offer an alternative approach to the protocol explained above, where transactions are verified before they are included in the blockchain state. This is achieved using zero-knowledge proofs, a cryptographic method that allows one party to prove to another that a statement is true without conveying any additional information.

Zero Knowledge Rollups can prevent faulty transactions from being included in the first place, thereby enhancing the integrity of the blockchain. However, they come with their own set of challenges.

The primary challenge is the complexity of zero-knowledge proofs. They are a relatively new and complex technology and implementing them correctly requires a high level of expertise. This can make Zero Knowledge Rollups more difficult to implement than other solutions.

Additionally, the computational requirements for generating and verifying zero-knowledge proofs can be significant. This can limit the scalability improvements that Zero-Knowledge Rollups can provide and may also lead to higher costs.

Teadal WP5: Experimenting with Periodic State Anchoring

As part of WP5, we have begun experimenting with these solutions. Initially, we deployed an Ethereum blockchain network that can function both as a public test network and a private one. This dual functionality allows us to explore the strengths and weaknesses of both types of blockchains in a controlled environment.

Following this, we implemented the concept of Periodic State Anchoring. To facilitate the verification of node statuses, we developed a front-end application. This application allows users to interact with the blockchain and verify the state of the nodes in an intuitive and user-friendly manner.

As we continue to explore and develop these innovative solutions, we are excited about the potential they hold for advancing blockchain technology. The lessons we learn from these experiments will undoubtedly play a crucial role in shaping the future of the Teadal project and the use of blockchains outside the cryptocurrency context.

A screenshot of the dashboard used to verify the private network status
A screenshot of the dashboard used to verify the private network status