![]() ![]() As I said many times, staging data to a lake is a good thing when you must deal with files. I'm sure that many large companies or companies with complex data integration needs could benefit from this architecture. Visualizing this in Microsoft parlor, the last incarnation of the lakehouse architecture that I came across looks like this: In fact, the famous guy in cube, Patrick LeBlanc, gave a great presentation on this subject to our Atlanta Power BI Group and you can find the recording here (I have to admit we could have done better job with the recording quality, but we are still learning in the post-COVID era).Īccording to Databricks which are credited with this term, a data lakehouse is "a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses, enabling business intelligence (BI) and machine learning (ML) on all data." It other words, it's a hybrid between a relational data warehouse and a data lake. ![]() There has been a lot of noise surrounding a data lakehouse nowadays, so I felt the urge to chime in. So, hello Lakehouse, goodbye landline! Your smartphone is reliable and fast enough to turn off your father's warehouse, all the while doing much more for far less!ĭata Lakehouse: The Good, The Bad, and the Ugly ![]() concurrently), you can run your ML libraries against the Delta Lake backed dataframes (sourced by table or file location)! So Delta supports linear algebra, vector/array interfaces that SQL resultsets and cursors do not! You do not need to know those intricacies because you simply need to run your SQL statements.ĪND, all the while (i.e. Whether Delta uses file snapshots or deletion vectors, the interface is an atomic write with isolated reads/writes (yes, fully ACID). This and many other techniques reduced file I/O, limited contention and improved read/write isolation! It was a huge war with all different DBMS vendors, like Sybase, then MS SQL Server went from page locks to row locks. Sure, you don't have to know those details to run a SQL query or insert/update/delete/merge data, but one should not make it sound like data warehouse persistence is free, and old-school databases didn't have to worry about file I/O! Of course they did. The conversation never delves into the fact that sometimes the server must and will decide to allocate more files or pages as necessitated by CRUD operations (or even index optimizations). MDX and LDX files on MS SQL Server) with actual physical I/O behind these database engines. When talking about new open source storage formats like Delta Lake, I have been in several conversations with old-school data warehouse people like below, that don't seem to know or acknowledge that there are files (e.g. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |