SLGP Header

Anonymization with MapReduce for Scalable BigData Privacy Preservation in Cloud

IJCSEC Front Page

Data privacy is one of the most important concern issues when processing large datasets in Big Data Applications. Collection of enormous data-sets which make it difficult to process using on-hand database management tools or traditional data processing techniques is termed as Big Data. Big Data is characterized by 3 V’s, Volume, Value and Variety. Privacy to such huge datasets is a big problem which can be achieved by anonymization technique. Datasets like Electronics Health Records in such applications contain sensitive information, which brings about privacy issues especially, if the information is shared to public for data Analytics. The purpose of big data anonymization is to protect the privacy of the individual and only aggregate information is disclosed and makes it legal to share their data without getting permission from individuals. However, existing Privacy preservation techniques suffers from poor scalability and privacy disclosure risks. Hence a Map Reduced based k-mean clustering for data partitions and Divisive Hierarchical clustering to preserve anonymization until it satisfies the minimum privacy constraint is proposed. Our result shows that the utility of data is preserved for highly restricted privacy requirement and this approach significantly improves the scalability and efficiency over existing anonymization techniques.
Keywords:Bigdata, MapReduce, Hadoop, Data Anonymization, Privacy Preservation.


  1. X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big data,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 1, pp. 97–107, Jan. 2014.
  2. X. Zhang, C. Liu, S. Nepal, S. Pandey, and J. Chen, “A privacy leakage upper bound constraint-based approach for cost-effective privacy preserving of intermediate data sets in cloud,” IEEE Trans. Parallel Distrib. Syst., vol. 24, no. 6, pp. 1192–1202, Jun. 2013.
  3. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserv-ing data publishing: A survey of recent developments,” ACM Comput. Survey, vol. 42, no. 4, pp. 1–53, 2010.
  4. L. Sweeney, “K-anonymity: A model for protecting privacy,” Int. J. Uncertainty Fuzziness, vol. 10, no. 5, pp. 557–570, 2002.
  5. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam, “L-diversity: Privacy beyond k-anonymity,” ACM Trans. Knowl. Discov-ery Data, vol. 1, no. 1, 2007.
  6. N. Li, T. Li, and S. Venkatasubramanian, “Closeness: A new pri-vacy measure for data publishing,” IEEE Trans. Knowl. Data Eng., vol. 22, no. 7, pp. 943–956, Jul. 2010.
  7. B. C. M. Fung, K. Wang, and P. S. Yu, “Anonymizing classification data for privacy preservation,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 5, pp. 711–725, May 2007.
  8. X. Xiao and Y. Tao, “Personalized privacy preservation,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2006, pp. 229–240. [9] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon, “Parallel data processing with Mapreduce: A survey,” ACM SIGMOD Record, vol. 40, no. 4, pp. 11–20, 2012.
  9. K. LeFevre, D.J. DeWitt and R. Ramakrishnan, “Mondrian Multidimensional k-Anonymity,” Proc. 22nd Int'l Conf. Data Engineering (ICDE '06), Article No. 25, 2006.
  10. B.C.M. Fung, K. Wang and P.S. Yu, “Anonymizing Classification Data for Privacy Preservation,” IEEE Trans. Knowle. Data Eng.,vol. 19, no. 5, pp. 711-725, 2007.
  11. N. Mohammed, B. Fung, P.C.K. Hung and C.K. Lee,“Centralized and Distributed Anonymization for High-Dimensional Healthcare Data,” ACM Trans. Knowl. Disc. Data, vol. 4, no. 4, Article No. 18, 2010
  12. Dean and S. Ghemawat, “Mapreduce: A Flexible Data Processing Tool,” Comm. ACM, vol. 53, no. 1, pp. 72-77, 2010.