Hadoop Ecosystem and Its Analysis on Tweets


Uzunkaya C., Ensari T., Kavurucu Y.

World Conference on Technology, Innovation and Entrepreneurship, İstanbul, Türkiye, 28 - 30 Mayıs 2015, ss.1890-1897 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.1016/j.sbspro.2015.06.429
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.1890-1897
  • İstanbul Üniversitesi Adresli: Evet

Özet

Hadoop is Java based programming framework for distributed storage and processing of large data sets on commodity hardware. It is developed by Apache Software Foundation as open source framework. Hadoop basically has two main components. First one is Hadoop Distributed File System (HDFS) for distributed storage and second part is MapReduce for distributed processing. HDFS is a file system which builds on the existing file system. It is Java-based sub project of Apache Hadoop. HDFS provides scalable and reliable data storage on commandity hardware. A master/slave architecture is used by HDFS. In this architecture, HDFS has a single NameNode and more than one DataNodes. The NameNode manages the file system and stores the metadata. It acts like a file manager on HDFS. Because all files and directories are represented on the NameNode. DataNodes stores the part of data. A file is splited into one or more blocks (default 64MB or 128MB) and that blocks are stored in DataNodes. MapReduce is a programming model which is used for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce job generally splits the input data set into independent blocks which are processed by the map tasks in a completely parallel manner. First step is mapping of data set in MapReduce architecture. The framework sorts the outputs of the mapping process, which are then input to the second step is reduce task. Input and the output of the job are stored in a file-system. The MapReduce framework consists of two process which are JobTracker and TaskTracker. The JobTracker manages the resources that are TaskTracker. The TaskTracker is a processing node in the cluster. It accepts several tasks like map reduce and shuffle from a Job Tracker. Twitter4J is an unofficial Java library for the Twitter application programming interface. It is integrated Java application with the all Twitter services. This paper focuses on Hadoop and its ecosystem and implementation Hadoop based platform for analyzing on collected tweets. The regarding analyzed results are transferred to graphical charts which is showed on a web page. (C) 2015 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).