Friday, October 7, 2011

Review: Pig Latin & Hive

PigLatin & Hive are designed with similar goal in mind and thus, they share common functionalities. Both of them are written to execute queries/plans on HDFS, an open-source, map-reduce implementation. Both of them also have schema for metadata. Both of them provide relatively simple SQL optimizations when compared to standard RDBMS.

Hive is a data warehouse infrastructure built on top of Hadoop that facillitates querying and managing large datasets residing in distributed storage. Hive also defines a simple SQL-like qeury, called QL. This SQL queries are compiled into MapReduce jobs to be executed as efficiently as possible. Providing this SQL-like interface is better for system administrations in my opinion since that means that sys admins will be familiar with the commands. PigLatin, which is created by Yahoo!, states that it is designed to be the sweet spot between SQL & MapReduce. The nice thing is that Pig has a debugger to its language. On the other hand, Hive has a web-interface to visualize the various schemas and issue queries which would be a great help to the developers.

Review: SCADS

SCADS look at the problem of scaling in the storage backend today. For example, if a company suddenly becomes popular or that in the event of ebay, during Christmast, there will be a drastic increase in the amount of query to the system's backends. As stated in the paper, it promises Data Scale Dependence in which it adjusts the capacity of the systems using machine learning models. With the existence of Amazon EC2, it's possible to do this. It's also interesting to note that the system gives the programmers the flexibility to trade off consistency and performance. Programmers could specify the level of consistencies that are to be implemented in the system (e.g. eventual consistency, etc). Since there are no implementations yet, I'm quite unsure about the impact of the papers since many things that could go wrong are undiscovered until the system has been implemented.

Review: Dryad Distributed Data-Parallel Programs from Sequential Building Blocks

Dryad is a general-purpose distributed execution engine for coarse-grain data-parallel applications presented as an alternative to MapReduce paradigm. The data flows are represented as directed acylic graphs (DAG). Relative to MapReduce, Dryad gives more flexibility to the programmers although at the expense of more complexities being presented to the programmers. Looking at the sample programs, I think that those complexities are not worth it, when the needed programs are substantial in size. I believe that another layer of abstraction (e.g. The nebula scripting language) is required. Its influence still needs to be questioned as this is a proprietary software written by Microsoft and thus, impact on the general programmers' community are barely seen. From the experiment results that are presented, this looks promising as the speed-up is pretty much linear.

Thursday, October 6, 2011

Review: MapReduce

MapReduce is used for programming distributed system that's currently very popular. Users specify a map function that processes a key/value pair to generate intermediate pairs of key/value, which is then passed to a reduce function that merges all those intermediate pairs of key/value in a way that's specified by the users too. This abstraction is very nice since the users do not need to focus on the complexities presented by distributed systems (e.g. failure tolerance, consistency issues, etc) and just focus on coding the main algorithm of the task (e.g. counting the number of words in the internet, etc) instead. The paper also goes on to describe how common tasks such as unix utilities could be adapted to the MapReduce programming framework. However, as simple this abstraction may be, this imposes restriction on the programmers to be creative. Also, I feel that some low-level details are still being exposed to the programmers, such as how the users are required to specify the number of mappers that they will need. I believe that there are a lot of improvements that need to be made on the programming framework.