Thursday, August 25, 2011

Kill Your Data Warehouse



http://ow.ly/6d09f

An article on Forbes.com by Dan Woods.

This article discusses the difference between a data warehouse, and what the author describes as a data "lake".


"Here are the primary differences between a data warehouse and a data lake:
  • In a data lake, end-users are far more involved in deciding when and how the information will be distilled. The creation of the equivalent of data cubes and other summaries to speed analysis is far faster and less intermediated by experts.

  • A data lake contains many more types of data than a data warehouse, which usually has transactional records from enterprise applications. In a data lake you will find also machine data from server logs, networking equipment, telecommunications equipment, and lots of different kinds of sensors. In addition, you will find unstructured information that can be used to add context to numerical information. (See A Vision for Unifying Access to Data and Documents.)

  • A data lake will use many more techniques to correlate and understand data than a data warehouse. Capabilities like Splunk and Hadoop and other MapReduce implementations will be employed to distill and summarize machine data. Complex event processing systems will sift through many streams of data looking for patterns. Unstructured data will be analyzed and correlated to structured data using capabilities like Attivio or Autonomy.

  • A data lake will be far more oriented toward in-memory processing in real time than batch processing, which dominates the world of data warehouses."

1 comment:

  1. Dan's article is 99% hyperbole and 100% BS, and a damning indictment of Forbes’s dubious journalistic standards.

    ReplyDelete