Large-scale News Image Analysis with MapReduce-based LSH and VisualRank

Hao Li (Ph.D. student from CS) and I conducted a big-data analysis project using the MapReduce framework (Hadoop) for the final project of INFM718G (Data-Intensive Computing with MapReduce, by Dr. Jimmy Lin). Targeting all the news images in April 2013, we tried to rank news images based on the importance and popularity level of each news image. To do that, we extracted image features using SIFT (Scale-invariant feature transform) and constructed a graph of images using LSH (Locality-sensitive Hashing) as a means to approximate the similarity of images. In the graph, nodes were image IDs, and the weight of an edge was the similarity between two images. Using the VisualRank algorithm (image version of PageRank), we ranked all the images based on importance (or popularity). The result of this project can be potentially beneficial in finding "important" news articles in addition to existing search algorithms. Hao Li took care of extracting image features and constructing graphs, and I implemented the VisualRank algorithm on the Hadoop framework and visualized the graph using NodeXL.