James Dixon described the data lake:
If you think of a data mart as a store of bottled water—cleansed and packaged and structured for easy consumption—the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
A data lake is essentially a single data repository that holds all your data until it is ready for analysis, or possibly only the data that doesn’t fit into your data warehouse. Typically, a data lake stores data in its native file format, but the data may be transformed to another format to make analysis more efficient. The goal of having a data lake is to extract business or other analytic value from the data.
Data lakes can host binary data, such as images and video, unstructured data, such as PDF documents, and semi-structured data, such as CSV and JSON files, as well as structured data, typically from relational databases. Structured data is more useful for analysis, but semi-structured data can easily be imported into a structured form. Unstructured data can often be converted to structured data using intelligent automation.
Data lake vs data warehouse
The major differences between data lakes and data warehouses:
- Data sources: Typical sources of data for data lakes include log files, data from click-streams, social media posts, and data from internet connected devices. Data warehouses typically store data extracted from transactional databases, line-of-business applications, and operational databases for analysis.
- Schema strategy: The database schema for a data lakes is usually applied at analysis time, which is called schema-on-read. The database schema for enterprise data warehouses is usually designed prior to the creation of the data store and applied to the data as it is imported. This is called schema-on-write.
- Storage infrastructure: Data warehouses often have significant amounts of expensive RAM and SSD disks in order to provide query results quickly. Data lakes often use cheap spinning disks on clusters of commodity computers. Both data warehouses and data lakes use massively parallel processing (MPP) to speed up SQL queries.
- Raw vs curated data: The data in a data warehouse is supposed to be curated to the point where the data warehouse can be treated as the “single source of truth” for an organization. Data in a data lake may or may not be curated: data lakes typically start with raw data, which can later be filtered and transformed for analysis.
- Who uses it: Data warehouse users are usually business analysts. Data lake users are more often data scientists or data engineers, at least initially. Business analysts get access to the data once it has been curated.
- Type of analytics: Typical analysis for data warehouses includes business intelligence, batch reporting, and visualizations. For data lakes, typical analysis includes machine learning, predictive analytics, data discovery, and data profiling.
You can read more about Data Lake here.
Nexlogica has the expert resources to support all your technology initiatives.
We are always happy to hear from you.
Click here to connect with our experts!
0 Comments