HADOOP-IEEE PROJECTS
HADOOP-IEEE PROJECTS
1.A Big Data Clustering Algorithm for Mitigating the Risk of Customer Churn
A Big Data Clustering Algorithm for Mitigating the Risk of Customer Churn
2.A Big Data Scale Algorithm for Optimal Scheduling of Integrated Microgrids
A Big Data Scale Algorithm for Optimal Scheduling of Integrated Microgrids
The capability of switching into the islanded operation mode of microgrids has been advocated as a viable solution to achieve high system reliability. This paper proposes a new model for the microgrids optimal scheduling and load curtailment problem. The proposed problem determines the optimal schedule for local generators of microgrids to minimize the generation cost of the associated distribution system in the normal operation. Moreover, when microgrids have to switch into the islanded operation mode due to reliability considerations, the optimal generation solution still guarantees for the minimal amount of load curtailment. Due to the large number of constraints in both normal and islanded operations, the formulated problem becomes a large-scale optimization problem and is very challenging to solve using the centralized computational method. Therefore, we propose a decomposition algorithm using the alternating direction method of multipliers (ADMM) that provides a parallel computational framework. The simulation results demonstrate the efficiency of our proposed model in reducing generation cost as well as guaranteeing the reliable operation of microgrids in the islanded mode. We finally describe the detailed implementation of parallel computation for our proposed algorithm to run on a computer cluster using the Hadoop MapReduce software framework.
3.A Cloud Service Architecture for Analyzing Big Monitoring Data
A Cloud Service Architecture for Analyzing Big Monitoring Data
Cloud monitoring is of a source of big data that are constantly produced from traces of infrastructures, platforms, and applications. Analysis of monitoring data delivers insights of the system’s workload and usage pattern and ensures workloads are operating at optimum levels. The analysis process involves data query and extraction, data analysis, and result visualization. Since the volume of monitoring data is big, these operations require a scalable and reliable architecture to extract, aggregate, and analyze data in an arbitrary range of granularity. Ultimately, the results of analysis become the knowledge of the system and should be shared and communicated. This paper presents our cloud service architecture that explores a search cluster for data indexing and query. We develop REST APIs that the data can be accessed by different analysis modules. This architecture enables extensions to integrate with software frameworks of both batch processing (such as Hadoop) and stream processing (such as Spark) of big data. The analysis results are structured in Semantic Media Wiki pages in the context of the monitoring data source and the analysis process. This cloud architecture is empirically assessed to evaluate its responsiveness when processing a large set of data records under node failures.
4.A comparative study of various clustering techniques on big data sets using Apache Mahout
A comparative study of various clustering techniques on big data sets using Apache Mahout
Clustering algorithms have materialized as an unconventional tool to precisely examine the immense volume of data produced by present applications. In specific, their main objective is to classify data into clusters such that objects are grouped in the same cluster when they are similar rendering to particular metrics and dissimilar to objects of other groups. From the machine learning perspective clustering can be viewed as unsupervised learning of concepts. Hadoop is a distributed file system and an open-source implementation of MapReduce dealing with big data. Apache Mahout clustering algorithms are implemented on top of Hadoop using MapReduce paradigm. In this paper three clustering algorithms are described: K-means, Fuzzy K-Means (FKM) and Canopy clustering implemented by using Apache Mahout as well as providing a comparison. In addition, we underlined the clustering algorithms that are the preeminent performing for big data.
5.A Parallel Patient Treatment Time Prediction Algorithm and Its Applications in Hospital Queuing-Recommendation in a Big Data Environment
A Parallel Patient Treatment Time Prediction Algorithm and Its Applications in Hospital Queuing-Recommendation in a Big Data Environment
Effective patient queue management to minimize patient wait delays and patient overcrowding is one of the major challenges faced by hospitals. Unnecessary and annoying waits for long periods result in substantial human resource and time wastage and increase the frustration endured by patients. For each patient in the queue, the total treatment time of all the patients before him is the time that he must wait. It would be convenient and preferable if the patients could receive the most efficient treatment plan and know the predicted waiting time through a mobile application that updates in real time. Therefore, we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient. We use realistic patient data from various hospitals to obtain a patient treatment time model for each task. Based on this large-scale, realistic dataset, the treatment time for each patient in the current queue of each task is predicted. Based on the predicted waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR calculates and predicts an efficient and convenient treatment plan recommended for the patient. Because of the large-scale, realistic dataset and the requirement for real-time response, the PTTP algorithm and HQR system mandate efficiency and low-latency response. We use an Apache Spark-based cloud implementation at the National Supercomputing Center in Changsha to achieve the aforementioned goals. Extensive experimentation and simulation results demonstrate the effectiveness and applicability of our proposed model to recommend an effective treatment plan for patients to minimize their wait times in hospitals. In this paper, we propose a Patient Treatment Time Prediction (PTTP) algorithm and a Hospital Queuing-Recommendation (HQR) system. Considering the real-time requirements, enormous data, and complexity of the system, we employ big data and cloud computing models for efficiency and scalability. The PTTP algorithm is trained based on an improved Random Forest (RF) algorithm for each treatment task, and the waiting time of each task is predicted based on the trained PTTP model. Then, HQR recommends an efficient and convenient treatment plan for each patient. Patients can see the recommended plan and predicted waiting time in real-time using a mobile application. Extensive experimentation and application results show that the PTTP algorithm achieves high precision and performance.
6.A Tutorial on Secure Outsourcing of Large-scale Computations for Big Data
A Tutorial on Secure Outsourcing of Large-scale Computations for Big Data
Today’s society is collecting a massive and exponentially growing amount of data that can potentially revolutionize scientific and engineering fields, and promote business innovations. With the advent of cloud computing, in order to analyze data in a cost-effective and practical way, users can outsource their computing tasks to the cloud, which offers access to vast computing resources on an on-demand and pay-per-use basis. However, since users’ data contains sensitive information that needs to be kept secret for ethical, security, or legal reasons, many users are reluctant to adopt cloud computing. To this end, researchers have proposed techniques that enable users to offload computations to the cloud while protecting their data privacy. In this paper, we review the recent advances in the secure outsourcing of large-scale computations for a big data analysis. We first introduce two most fundamental and common computational problems, i.e., linear algebra and optimization, and then provide an extensive review of the data privacy preserving techniques. After that, we explain how researchers have exploited the data privacy preserving techniques to construct secure outsourcing algorithms for large-scale computations.
7.An efficient key partitioning scheme for heterogeneous MapReduce clusters
An efficient key partitioning scheme for heterogeneous MapReduce clusters
Hadoop is a standard implementation of MapReduce framework for running data-intensive applications on the clusters of commodity servers. By thoroughly studying the framework we find out that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the application performance. There is a problem of variance in both the intermediate key’s frequencies and their distribution among data nodes throughout the cluster in Hadoop’s MapReduce system. This variance in system causes network overhead which leads to unfairness on the reduce input among different data nodes in the cluster. Because of the above problem, applications experience performance degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm; unlike previous systems our algorithm considers a node’s capabilities as heuristics to decide a better available trade-off for the locality and fairness in the system. By comparing with the default Hadoop’s partitioning algorithm and Leen algorithm, on the average our approach achieve performance gain of 29% and 17%, respectively.
8.An improved HDFS for small file
An improved HDFS for small file
Hadoop is an open source distributed computing platform, and HDFS is Hadoop distributed file system. The HDFS has a powerful data storage capacity. Therefore, it is suitable for cloud storage system. However, HDFS was originally developed for the streaming access on large software, it has low storage efficiency for massive small files. To solve this problem, the HDFS file storage process is improved. The files are judged before uploading to HDFS clusters. If the file is a small file, it is merged and the index information of the small file is stored in the index file with the form of key-value pairs. The simulation shows that the improved HDFS has lower NameNode memory consumption than original HDFS and Hadoop Archives (HAR files). Thus, it can improve the access efficiency.
9.Assessing Big Data SQL Frameworks for Analyzing Event Logs
Assessing Big Data SQL Frameworks for Analyzing Event Logs
Performing Process Mining by analyzing event logs generated by various systems is a very computation and I/O intensive task. Distributed computing and Big Data processing frameworks make it possible to distribute all kinds of computation tasks to multiple computers instead of performing the whole task in a single computer. This paper assesses whether contemporary structured query language (SQL) supporting Big Data processing frameworks are mature enough to be efficiently used to distribute computation of two central Process Mining tasks to two dissimilar clusters of computers providing BPM as a service in the cloud. Tests are performed by using a novel automatic testing framework detailed in this paper and its supporting materials. As a result, an assessment is made on how well selected Big Data processing frameworks manage to process and to parallelize the analysis work required by Process Mining tasks.
10.Big Data Analytics in Mobile Cellular Networks
Big Data Analytics in Mobile Cellular Networks
Mobile cellular networks have become both the generators and carriers of massive data. Big data analytics can improve the performance of mobile cellular networks and maximize the revenue of operators. In this paper, we introduce an unified data model based on random matrix theory and machine learning. Then, we present an architectural framework for applying big data analytics in mobile cellular networks. Moreover, we describe several illustrative examples, including big signaling data, big traffic data, big location data, big radio waveforms data, and big heterogeneous data in mobile cellular networks. Finally, we discuss a number of open research challenges of big data analytics in mobile cellular networks.
11.Data analysis for chronic disease -diabetes using map reduce technique
Data analysis for chronic disease -diabetes using map reduce technique
Chronic disease endured for a long period of time. They are only been controlled but cannot be cured completely. Most of the people in the world are affected by chronic disease. In foreign countries like U.S. most of the death happens due to chronic disease. Some of the chronic diseases are Allergy, Cancer, Asthma, Heart disease, Glaucoma, Obesity, viral diseases such as Hepatitis C and HIV/AIDS. Of all the diseases Diabetes is the most hazardous disease. Diabetes means that blood glucose (blood sugar is too high. It is categorized into two divisions: Diabetes of category 1 and diabetes of category 2. In category 1, the human body does not make insulin, people with type1 need to take insulin every day. In type 2 the glucose level is of very high in the blood, it is one of the most common forms of diabetes. In type 2 diabetes, need to do physical activity and should have proper diet. The analysis on the data is performed using Big data analytics framework Hadoop. Hadoop framework is used to process large data sets. The analysis is done using map reduce algorithm.
12.Data and Energy Integrated Communication Networks for Wireless Big Data
Data and Energy Integrated Communication Networks for Wireless Big Data
This paper describes a new type of communication network called data and energy integrated communication networks (DEINs), which integrates the traditionally separate two processes, i.e., wireless information transfer (WIT) and wireless energy transfer (WET), fulfilling co-transmission of data and energy. In particular, the energy transmission using radio frequency is for the purpose of energy harvesting (EH) rather than information decoding. One driving force of the advent of DEINs is wireless big data, which comes from wireless sensors that produce a large amount of small piece of data. These sensors are typically powered by battery that drains sooner or later and will have to be taken out and then replaced or recharged. EH has emerged as a technology to wirelessly charge batteries in a contactless way. Recent research work has attempted to combine WET with WIT, typically under the label of simultaneous wireless information and power transfer. Such work in the literature largely focuses on the communication side of the whole wireless networks with particular emphasis on power allocation. The DEIN communication network proposed in this paper regards the convergence of WIT and WET as a full system that considers not only the physical layer but also the higher layers, such as media access control and information routing. After describing the DEIN concept and its high-level architecture/protocol stack, this paper presents two use cases focusing on the lower layer and the higher layer of a DEIN network, respectively. The lower layer use case is about a fair resource allocation algorithm, whereas the high-layer section introduces an efficient data forwarding scheme in combination with EH. The two case studies aim to give a better explanation of the DEIN concept. Some future research directions and challenges are also pointed out.
13.Defining Human Behaviors Using Big Data Analytics in Social Internet of Things
Defining Human Behaviors Using Big Data Analytics in Social Internet of Things
As we delve into the Internet of Things (IoT), we are witnessing the intensive interaction and heterogeneous communication among different devices over the Internet. Consequently, these devices generate a massive volume of Big Data. The potential of these data has been analyzed by the complex network theory, describing a specialized branch, known as ‘Human Dynamics.’ The potential of these data has been analyzed by the complex network theory, describing a specialized branch, known as ‘Human Dynamics.’ In this extension, the goal is to describe human behavior in the social area at real-time. These objectives are starting to be practicable through the quantity of data provided by smartphones, social network, and smart cities. These make the environment more intelligent and offer an intelligent space to sense our activities or actions, and the evolution of the ecosystem. To address the aforementioned needs, this paper presents the concept of ‘defining human behavior’ using Big Data in SIoT by proposing system architecture that processes and analyzes big data in real-time. The proposed architecture consists of three operational domains, i.e., object, SIoT server, application domain. Data from object domain is aggregated at SIoT server domain, where the data is efficiently store and process and intelligently respond to the outer stimuli. The proposed system architecture focuses on the analysis the ecosystem provided by Smart Cities, wearable devices (e.g., body area network) and Big Data to determine the human behaviors as well as human dynamics. Furthermore, the feasibility and efficiency of the proposed system are implemented on Hadoop single node setup on UBUNTU 14.04 LTS coreTMi5 machine with 3.2 GHz processor and 4 GB memory.
There are no reviews yet.