.NET-IEEE PROJECTS
.NET-IEEE PROJECTS
1 . Dynamic and Public Auditing with Fair Arbitration for Cloud Data
ABSTRACT
Cloud users no longer physically possess their data, so how to ensure the integrity of their outsourced data becomes a challenging task. Recently proposed schemes such as “provable data possession” and “proofs of retrievability” are designed to address this problem, but they are designed to audit static archive data and therefore lack of data dynamics support. Moreover, threat models in these schemes usually assume an honest data owner and focus on detecting a dishonest cloud service provider despite the fact that clients may also misbehave. This paper proposes a public auditing scheme with data dynamics support and fairness arbitration of potential disputes. In particular, we design an index switcher to eliminate the limitation of index usage in tag computation in current schemes and achieve efficient handling of data dynamics. To address the fairness problem so that no party can misbehave without being detected, we further extend existing threat models and adopt signature exchange idea to design fair arbitration protocols, so that any possible dispute can be fairly settled. The security analysis shows our scheme is provably secure, and the performance evaluation demonstrates the overhead of data dynamics and dispute arbitration are reasonable.
2 . Enabling Cloud Storage Auditing with Verifiable Outsourcing of Key Updates
ABSTRACT
Key-exposure resistance has always been an important issue for in-depth cyber defence in many security applications. Recently, how to deal with the key exposure problem in the settings of cloud storage auditing has been proposed and studied. To address the challenge, existing solutions all require the client to update his secret keys in every time period, which may inevitably bring in new local burdens to the client, especially those with limited computation resources, such as mobile phones. In this paper, we focus on how to make the key updates as transparent as possible for the client and propose a new paradigm called cloud storage auditing with verifiable outsourcing of key updates. In this paradigm, key updates can be safely outsourced to some authorized party, and thus the key-update burden on the client will be kept minimal. In particular, we leverage the third party auditor (TPA) in many existing public auditing designs, let it play the role of authorized party in our case, and make it in charge of both the storage auditing and the secure key updates for key-exposure resistance. In our design, TPA only needs to hold an encrypted version of the client’s secret key while doing all these burdensome tasks on behalf of the client. The client only needs to download the encrypted secret key from the TPA when uploading new files to cloud. Besides, our design also equips the client with capability to further verify the validity of the encrypted secret keys provided by the TPA. All these salient features are carefully designed to make the whole auditing procedure with key exposure resistance as transparent as possible for the client. We formalize the definition and the security model of this paradigm. The security proof and the performance simulation show that our detailed design instantiations are secure and efficient.
3 . Providing User Security Guarantees in Public Infrastructure Clouds
ABSTRACT
The infrastructure cloud (IaaS) service model offers improved resource flexibility and availability, where tenants – insulated from the minutiae of hardware maintenance – rent computing resources to deploy and operate complex systems. Large-scale services running on IaaS platforms demonstrate the viability of this model; nevertheless, many organizations operating on sensitive data avoid migrating operations to IaaS platforms due to security concerns. In this paper, we describe a framework for data and operation security in IaaS, consisting of protocols for a trusted launch of virtual machines and domain-based storage protection. We continue with an extensive theoretical analysis with proofs about protocol resistance against attacks in the defined threat model. The protocols allow trust to be established by remotely attesting host platform configuration prior to launching guest virtual machines and ensure confidentiality of data in remote storage, with encryption keys maintained outside of the IaaS domain. Presented experimental results demonstrate the validity and efficiency of the proposed protocols. The framework prototype was implemented on a test bed operating a public electronic health record system, showing that the proposed protocols can be integrated into existing cloud environments.
4 . Service Usage Classification with Encrypted Internet Traffic in Mobile Messaging Apps
ABSTRACT
The rapid adoption of mobile messaging Apps has enabled us to collect massive amount of encrypted Internet traffic of mobile messaging. The classification of this traffic into different types of in-App service usages can help for intelligent network management, such as managing network bandwidth budget and providing quality of services. Traditional approaches for classification of Internet traffic rely on packet inspection, such as parsing HTTP headers. However, messaging Apps are increasingly using secure protocols, such as HTTPS and SSL, to transmit data. This imposes significant challenges on the performances of service usage classification by packet inspection. To this end, in this paper, we investigate how to exploit encrypted Internet traffic for classifying in-App usages. Specifically, we develop a system, named CUMMA, for classifying service usages of mobile messaging Apps by jointly modeling user behavioral patterns, network traffic characteristics and temporal dependencies. Along this line, we first segment Internet traffic from traffic-flows into sessions with a number of dialogs in a hierarchical way. Also, we extract the discriminative features of traffic data from two perspectives: (i) packet length and (ii) time delay. Next, we learn a service usage predictor to classify these segmented dialogs into single-type usages or outliers. In addition, we design a clustering Hidden Markov Model (HMM) based method to detect mixed dialogs from outliers and decompose mixed dialogs into sub-dialogs of single-type usage. Indeed, CUMMA enables mobile analysts to identify service usages and analyze end-user in-App behaviors even for encrypted Internet traffic. Finally, the extensive experiments on real-world messaging data demonstrate the effectiveness and efficiency of the proposed method for service usage classification.
5 . Text Mining the Contributors to Rail Accidents
ABSTRACT
Rail accidents represent an important safety concern for the transportation industry in many countries. In the 11 years from 2001 to 2012, the U.S. had more than 40 000 rail accidents that cost more than $45 million. While most of the accidents during this period had very little cost, about 5200 had damages in excess of $141 500. To better understand the contributors to these extreme accidents, the Federal Railroad Administration has required the railroads involved in accidents to submit reports that contain both fixed field entries and narratives that describe the characteristics of the accident. While a number of studies have looked at the fixed fields, none have done an extensive analysis of the narratives. This paper describes the use of text mining with a combination of techniques to automatically discover accident characteristics that can inform a better understanding of the contributors to the accidents. The study evaluates the efficacy of text mining of accident narratives by assessing predictive performance for the costs of extreme accidents. The results show that predictive accuracy for accident costs significantly improves through the use of features found by text mining and predictive accuracy further improves through the use of modern ensemble methods. Importantly, this study also shows through case examples how the findings from text mining of the narratives can improve understanding of the contributors to rail accidents in ways not possible through only fixed field analysis of the accident reports.
6 . MMBcloud-tree: Authenticated Index for Verifiable Cloud Service Selection
ABSTRACT
Cloud brokers have been recently introduced as an additional computational layer to facilitate cloud selection and service management tasks for cloud consumers. However, existing brokerage schemes on cloud service selection typically assume that brokers are completely trusted, and do not provide any guarantee over the correctness of the service recommendations. It is then possible for a compromised or dishonest broker to easily take advantage of the limited capabilities of the clients and provide incorrect or incomplete responses. To address this problem, we propose an innovative Cloud Service Selection Verification (CSSV) scheme and index structures (MMBcloud-tree) to enable cloud clients to detect misbehavior of the cloud brokers during the service selection process. We demonstrate correctness and efficiency of our approaches both theoretically and empirically.
7 . Identity-Based Proxy-Oriented Data Uploading and Remote Data Integrity Checking in Public Cloud
ABSTRACT
More and more clients would like to store their data to public cloud servers (PCSs) along with the rapid development of cloud computing. New security problems have to be solved in order to help more clients process their data in public cloud. When the client is restricted to access PCS, he will delegate its proxy to process his data and upload them. On the other hand, remote data integrity checking is also an important security problem in public cloud storage. It makes the clients check whether their outsourced data are kept intact without downloading the whole data. From the security problems, we propose a novel proxy-oriented data uploading and remote data integrity checking model in identity-based public key cryptography: identity-based proxy-oriented data uploading and remote data integrity checking in public cloud (ID-PUIC). We give the formal definition, system model, and security model. Then, a concrete ID-PUIC protocol is designed using the bilinear pairings. The proposed ID-PUIC protocol is provably secure based on the hardness of computational Diffie-Hellman problem. Our ID-PUIC protocol is also efficient and flexible. Based on the original client’s authorization, the proposed ID-PUIC protocol can realize private remote data integrity checking, delegated remote data integrity checking, and public remote data integrity checking.
8 .Fine-grained Two-factor Access Control for Web-based Cloud Computing Services
ABSTRACT
In this paper, we introduce a new fine-grained two-factor authentication (2FA) access control system for web-based cloud computing services. Specifically, in our proposed 2FA access control system, an attribute-based access control mechanism is implemented with the necessity of both a user secret key and a lightweight security device. As a user cannot access the system if they do not hold both, the mechanism can enhance the security of the system, especially in those scenarios where many users share the same computer for web-based cloud services. In addition, attribute-based control in the system also enables the cloud server to restrict the access to those users with the same set of attributes while preserving user privacy, i.e., the cloud server only knows that the user fulfills the required predicate, but has no idea on the exact identity of the user. Finally, we also carry out a simulation to demonstrate the practicability of our proposed 2FA system.
9 .Cloud workflow scheduling with deadlines and time slot availability
ABSTRACT
Allocating service capacities in cloud computing is based on the assumption that they are unlimited and can be used at any time. However, available service capacities change with workload and cannot satisfy users’ requests at any time from the cloud provider’s perspective because cloud services can be shared by multiple tasks. Cloud service providers provide available time slots for new user’s requests based on available capacities. In this paper, we consider workflow scheduling with deadline and time slot availability in cloud computing. An iterated heuristic framework is presented for the problem under study which mainly consists of initial solution construction, improvement, and perturbation. Three initial solution construction strategies, two greedy- and fair-based improvement strategies and a perturbation strategy are proposed. Different strategies in the three phases result in several heuristics. Experimental results show that different initial solution and improvement strategies have different effects on solution qualities.
10 .Publicly Verifiable Inner Product Evaluation over Outsourced Data Streams under Multiple Keys
ABSTRACT
Uploading data streams to a resource-rich cloud server for inner product evaluation, an essential building block in many popular stream applications (e.g., statistical monitoring), is appealing to many companies and individuals. On the other hand, verifying the result of the remote computation plays a crucial role in addressing the issue of trust. Since the outsourced data collection likely comes from multiple data sources, it is desired for the system to be able to pinpoint the originator of errors by allotting each data source a unique secret key, which requires the inner product verification to be performed under any two parties’ different keys. However, the present solutions either depend on a single key assumption or powerful yet practicallyinefficient fully homomorphic cryptosystems. In this paper, we focus on the more challenging multi-key scenario where data streams are uploaded by multiple data sources with distinct keys. We first present a novel homomorphic verifiable tag technique to publicly verify the outsourced inner product computation on the dynamic data streams, and then extend it to support the verification of matrix product computation. We prove the security of our scheme in the random oracle model. Moreover, the experimental result also shows the practicability of our design.
11 .Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search
ABSTRACT
With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatiotextual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study two fundamental problems in the spatial keyword queries: top k spatial keyword search (TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the closest k objects each of which contains all keywords in the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top k spatial keyword search. To further enhance the filtering capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.
12 .Securing SIFT: Privacy-preserving Outsourcing Computation of Feature Extractions over Encrypted Image Data
ABSTRACT
Advances in cloud computing have greatly motivated data owners to outsource their huge amount of personal multimedia data and/or computationally expensive tasks onto the cloud by leveraging its abundant resources for cost saving and flexibility. Despite the tremendous benefits, the outsourced multimedia data and its originated applications may reveal the data owner’s private information, such as the personal identity, locations or even financial profiles. This observation has recently aroused new research interest on privacy-preserving computations over outsourced multimedia data. In this paper, we propose an effective and practical privacy-preserving computation outsourcing protocol for the prevailing scale-invariant feature transform (SIFT) over massive encrypted image data. We first show that previous solutions to this problem have either efficiency/security or practicality issues, and none can well preserve the important characteristics of the original SIFT in terms of distinctiveness and robustness. We then present a new scheme design that achieves efficiency and security requirements simultaneously with the preservation of its key characteristics, by randomly splitting the original image data, designing two novel efficient protocols for secure multiplication and comparison, and carefully distributing the feature extraction computations onto two independent cloud servers. We both carefully analyze and extensively evaluate the security and effectiveness of our design. The results show that our solution is practically secure, outperforms the state-of-theart, and performs comparably to the original SIFT in terms of various characteristics, including rotation invariance, image scale invariance, robust matching across affine distortion, addition of noise and change in 3D viewpoint and illumination.
13 .A Secure and Dynamic Multi-keyword Ranked Search Scheme over Encrypted Cloud Data
ABSTRACT
Due to the increasing popularity of cloud computing, more and more data owners are motivated to outsource their data to cloud servers for great convenience and reduced cost in data management. However, sensitive data should be encrypted before outsourcing for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this paper, we present a secure multi-keyword ranked search scheme over encrypted cloud data, which simultaneously supports dynamic update operations like deletion and insertion of documents. Specifically, the vector space model and the widely-used TF IDF model are combined in the index construction and query generation. We construct a special tree-based index structure and propose a “Greedy Depth-first Search” algorithm to provide efficient multi-keyword ranked search. The secure kNN algorithm is utilized to encrypt the index and query vectors, and meanwhile ensure accurate relevance score calculation between encrypted index and query vectors. In order to resist statistical attacks, phantom terms are added to the index vector for blinding search results . Due to the use of our special tree-based index structure, the proposed scheme can achieve sub-linear search time and deal with the deletion and insertion of documents flexibly. Extensive experiments are conducted to demonstrate the efficiency of the proposed scheme.
14 .Protecting Your Right: Verifiable Attribute-based Keyword Search with Fine-grained Owner-enforced Search Authorization in the Cloud
ABSTRACT
DSearch over encrypted data is a critically important enabling technique in cloud computing, where encryption-beforeoutsourcing is a fundamental solution to protecting user data privacy in the untrusted cloud server environment. Many secure search schemes have been focusing on the single-contributor scenario, where the outsourced dataset or the secure searchable index of the dataset are encrypted and managed by a single owner, typically based on symmetric cryptography. In this paper, we focus on a different yet more challenging scenario where the outsourced dataset can be contributed from multiple owners and are searchable by multiple users, i.e. multi-user multi-contributor case. Inspired by attribute-based encryption (ABE), we present the first attribute-based keyword search scheme with efficient user revocation (ABKS-UR) that enables scalable fine-grained (i.e. file-level) search authorization. Our scheme allows multiple owners to encrypt and outsource their data to the cloud server independently. Users can generate their own search capabilities without relying on an always online trusted authority. Fine-grained search authorization is also implemented by the owner-enforced access policy on the index of each file. Further, by incorporating proxy re-encryption and lazy re-encryption techniques, we are able to delegate heavy system update workload during user revocation to the resourceful semi-trusted cloud server. We formalize the security definition and prove the proposed ABKS-UR scheme selectively secure against chosen-keyword attack. To build confidence of data user in the proposed secure search system, we also design a search result verification scheme. Finally, performance evaluation shows that the efficiency of our scheme.
15 .Secure Data Analytics for Cloud-Integrated Internet of Things Applications
ABSTRACT
Cloud-integrated Internet of Things (IoT) is emerging as the next-generation service platform that enables smart functionality worldwide. IoT applications such as smart grid and power systems, e-health, and body monitoring applications along with large-scale environmental and industrial monitoring are increasingly generating large amounts of data that can conveniently be analyzed through cloud service provisioning. However, the nature of these applications mandates the use of secure and privacy-preserving implementation of services that ensures the integrity of data without any unwarranted exposure. This article explores the unique challenges and issues within this context of enabling secure cloud-based data analytics for the IoT. Three main applications are discussed in detail, with solutions outlined based on the use of fully homomorphic encryption systems to achieve data security and privacy over cloud-based analytical phases. The limitations of existing technologies are discussed and models proposed with regard to achieving high efficiency and accuracy in the provisioning of analytic services for encrypted data over a cloud platform.
16 .A Low-Cost Low-Power Ring Oscillator-based Truly Random Number Generator for Encryption on Smart Cards
ABSTRACT
The design of a low-cost low-power ring oscillator-based truly random number generator (TRNG) macro-cell, suitable to be integrated in smart cards, is presented. The oscillator sampling technique is exploited and a tetrahedral oscillator with large jitter has been employed to realize the TRNG. Techniques to improve the statistical quality of the ring oscillator-based TRNGs’ bit sequences have been presented and verified by simulation and measurement. Post digital processor is added to further enhance the randomness of the output bits. m standard CMOS process, the proposed TRNGmFabricated in HHNEC 0.13 has an area as low as 0.005 mm2. Powered by a single 1.8 V supply W. Bit rate of the TRNGmvoltage, the TRNG has a power consumption of 40 after post processing is 100 kb/s. The proposed TRNG has been made into an IP and successfully applied in an SD card for encryption application. The proposed TRNG has passed the NIST tests and Diehard tests.
17 .Encrypted Data Management with Deduplication in Cloud Computing
ABSTRACT
Cloud computing offers a new way to deliver services by rearranging resources over the Internet and providing them to users on demand. It plays an important role in supporting data storage, processing, and management in the Internet of Things (IoT). Various cloud service providers (CSPs) offer huge volumes of storage to maintain and manage IoT data, which can include videos, photos, and personal health records. These CSPs provide desirable service properties, such as scalability, elasticity, fault tolerance, and pay per use. Thus, cloud computing has become a promising service paradigm to support IoT applications and IoT system deployment. To ensure data privacy, existing research proposes to outsource only encrypted data to CSPs. However, the same or different users could save duplicated data under different encryption schemes at the cloud. Although cloud storage space is huge, this kind of duplication wastes networking resources, consumes excess power, and complicates data management. Thus, saving storage is becoming a crucial task for CSPs. Deduplication can achieve high space and cost savings, reducing up to 90 to 95 percent of storage needs for backup applications (http://opendedup.org) and up to 68 percent in standard file systems.1 Obviously, the savings, which can be passed back directly or indirectly to cloud users, are significant to the economics of cloud business. At the same time, data owners want CSPs to protect their personal data from unauthorized access. CSPs should therefore perform access control based on the data owner’s expectations. In addition, data owners want to control not only data access but also its storage and usage. From a flexibility viewpoint, data deduplication should cooperate with data access control mechanisms. That is, the same data, although in an encrypted form, is only saved once at the cloud but can be accessed by different users based on the data owners’ policies.
18 .Dual-Server Public-Key Encryption with Keyword Search for Secure Cloud Storage
ABSTRACT
Searchable encryption is of increasing interest for protecting the data privacy in secure searchable cloud storage. In this work, we investigate the security of a well-known cryptographic primitive, namely Public Key Encryption with Keyword Search (PEKS) which is very useful in many applications of cloud storage. Unfortunately, it has been shown that the traditional PEKS framework suffers from an inherent insecurity called inside Keyword Guessing Attack (KGA) launched by the malicious server. To address this security vulnerability, we propose a new PEKS framework named Dual-Server Public Key Encryption with Keyword Search (DS-PEKS). As another main contribution, we define a new variant of the Smooth Projective Hash Functions (SPHFs) referred to as linear and homomorphic SPHF (LH-SPHF). We then show a generic construction of secure DS-PEKS from LH-SPHF. To illustrate the feasibility of our new framework, we provide an efficient instantiation of the general framework from a DDH-based LH-SPHF and show that it can achieve the strong security against inside KGA.
19 .A recommendation system based on hierarchical clustering of an article-level citation network
ABSTRACT
The scholarly literature is expanding at a rate that necessitates intelligent algorithms for search and navigation.For the most part, the problem of delivering scholarly articles has been solved. If one knows the title of an article, locating it requires little effort and, paywalls permitting, acquiring a digital copy has become trivial.However, the navigational aspect of scientific search – finding relevant, influential articles that one does not know exist – is in its early development. In this paper, we introduce Eigenfactor Recommends – a citation-based method for improving scholarly navigation. The algorithm uses the hierarchical structure of scientific knowledge, making possible multiple scales of relevance for different users. We implement the method and generate more than 300 million recommendations from more than 35 million articles from various bibliographic databases including the AMiner dataset. We find little overlap with co-citation, another well-known citation recommender, which indicates potential complementarity. In an online A-B comparison using SSRN, we find that our approach performs as well as co-citation, but this new approach offers much larger recommendation coverage. We make the code and recommendations freely available at babel.eigenfactor.org and provide an API for others to use for implementing and comparing the recommendations on their own platforms.
20 .Efficient Group Key Transfer Protocol for WSNs
ABSTRACT
Special designs are needed for cryptographic schemes in wireless sensor networks (WSNs). This is because sensor nodes are limited in memory storage and computational power. The existing group key transfer protocols for WSNs using classical secret sharing require that a t-degree interpolating polynomial be computed in order to encrypt and decrypt the secret group key. This approach is too computationally intensive. In this paper, we propose a new group key transfer protocol using a linear secret sharing scheme (LSSS) and factoring assumption. The proposed protocol can resist potential attacks and also significantly reduce the computation complexity of the system while maintaining low communication cost. Such a scheme is desirable for secure group communications in wireless sensor networks (WSNs), where portable devices or sensors need to reduce their computation as much as possible due to battery power limitations.
21 .A modified hierarchical attribute-based encryption access control method for mobile cloud computing
ABSTRACT
Cloudcomputing is an Internet-basedcomputing pattern through which shared resources are provided to devices ondemand. Its an emerging but promising paradigm to integrating mobile devices into cloudcomputing, and the integration performs in the cloudbasedhierarchical multi-user data-shared environment. With integrating into cloudcomputing, security issues such as data confidentiality and user authority may arise in the mobilecloudcomputing system, and it is concerned as the main constraints to the developments of mobilecloudcomputing. In order to provide safe and secure operation, a hierarchicalaccesscontrolmethod using modifiedhierarchicalattribute-basedencryption (M-HABE) and a modified three-layer structure is proposed in this paper. In a specific mobilecloudcomputing model, enormous data which may be from all kinds of mobile devices, such as smart phones, functioned phones and PDAs and so on can be controlled and monitored by the system, and the data can be sensitive to unauthorized third party and constraint to legal users as well. The novel scheme mainly focuses on the data processing, storing and accessing, which is designed to ensure the users with legal authorities to get corresponding classified data and to restrict illegal users and unauthorized legal users get access to the data, which makes it extremely suitable for the mobilecloudcomputing paradigms.
22 .Using crowd sourcing to provide QoS for mobile cloud computing
ABSTRACT
Quality of cloud service (QoS) is one of the crucial factors for the success of cloud providers in mobile cloud computing. Context-awareness is a popular method for automatic awareness of the mobile environment and choosing the most suitable cloud provider. Lack of context information may harm the users’ confidence in the application rendering it useless. Thus, mobile devices need to be constantly aware of the environment and to test the performance of each cloud provider, which is inefficient and wastes energy. Crowd sourcing is a considerable technology to discover and select cloud services in order to provide intelligent, efficient, and stable discovering of services for mobile users based on group choice. This article introduces a crowd sourcing-based QoS supported mobile cloud service framework that fulfils mobile users’ satisfaction by sensing their context information and providing appropriate services to each of the users. Based on user’s activity context, social context, service context, and device context, our framework dynamically adapts cloud service for the requests in different kinds of scenarios. The context-awareness based management approach efficiency achieves a reliable cloud service supported platform to supply the Quality of Service on mobile device.
23 .Towards achieving data security with the cloud computing adoption framework
ABSTRACT
Offering real-time data security for peta bytes of data is important for cloud computing. A recent survey on cloud security states that the security of users’ data has the highest priority as well as concern. We believe this can only be able to achieve with an approach that is systematic, adoptable and well-structured. Therefore, this paper has developed a framework known as Cloud Computing Adoption Framework (CCAF) which has been customized for securing cloud data. This paper explains the overview, rationale and components in the CCAF to protect data security. CCAF is illustrated by the system design based on the requirements and the implementation demonstrated by the CCAF multi-layered security. Since our Data Centre has 10 peta bytes of data, there is a huge task to provide real-time protection and quarantine. We use Business Process Modelling Notation (BPMN) to simulate how data is in use. The use of BPMN simulation allows us to evaluate the chosen security performances before actual implementation. Results show that the time to take control of security breach can take between 50 and 125 hours. This means that additional security is required to ensure all data is well-protected in the crucial 125 hours. This paper has also demonstrated that CCAF multi-layered security can protect data in real-time and it has three layers of security: 1) firewall and access control; 2) identity management and intrusion prevention and 3) convergent encryption. To validate CCAF, this paper has undertaken two sets of ethical-hacking experiments involved with penetration testing with 10,000 Trojans and viruses. The CCAF multi-layered security can block 9,919 viruses and Trojans which can be destroyed in seconds and the remaining ones can be quarantined or isolated. The experiments show although the percentage of blocking can decrease for continuous injection of viruses and Trojans, 97.43 percent of them can be quarantined. Our CCAF multi-layered security has an average of 20 percent better performance than the single-layered approach which could only block 7,438 viruses and Trojans. CCAF can be more effective when combined with BPMN simulation to evaluate security process and penetrating testing results.
24 .A combinatorial auction mechanism for multiple resource procurement in cloud computing
ABSTRACT
Multiple resource procurement from several cloud vendors participating in bidding is addressed in this paper. This is done by assigning dynamic pricing for these resources. Since we consider multiple resources to be procured from several cloud vendors bidding in an auction, the problem turns out to be one of a combinatorial auction. We pre-process the user requests, analyze the auction and declare a set of vendors bidding for the auction as winners based on the Combinatorial Auction Branch on Bids (CABOB) model. Simulations using our approach with prices procured from several cloud vendors’ datasets show its effectiveness in multiple resource procurement in the realm of cloud computing.
25 .Online resource scheduling under concave pricing for cloud computing
ABSTRACT
With the booming cloud computing industry, computational resources are readily and elastically available to the customers. In order to attract customers with various demands, most Infrastructure-as-a-service (IaaS) cloud service providers offer several pricing strategies such as pay as you go, pay less per unit when you use more (so called volume discount), and pay even less when you reserve. The diverse pricing schemes among different IaaS service providers or even in the same provider form a complex economic landscape that nurtures the market of cloud brokers. By strategically scheduling multiple customers’ resource requests, a cloud broker can fully take advantage of the discounts offered by cloud service providers. In this paper, we focus on how a broker can help a group of customers to fully utilize the volume discount pricing strategy offered by cloud service providers through cost-efficient online resource scheduling. We present a randomized online stack-centric scheduling algorithm (ROSA) and theoretically prove the lower bound of its competitive ratio. Three special cases of the offline concave cost scheduling problem and the corresponding optimal algorithms are introduced. Our simulation shows that ROSA achieves a competitive ratio close to the theoretical lower bound under the special cases. Trace-driven simulation using Google cluster data demonstrates that ROSA is superior to the conventional online scheduling algorithms in terms of cost saving.
26 .A survey of proxy re-encryption for secure data sharing in cloud computing
ABSTRACT
Never before have data sharing been more convenient with the rapid development and wide adoption of cloud computing. However, how to ensure the cloud user’s data security is becoming the main obstacles that hinder cloud computing from extensive adoption. Proxy re–encryption serves as a promising solution to secure the data sharing in the cloud computing. It enables a data owner to encrypt shared data in cloud under its own public key, which is further transformed by a semi trusted cloud server into an encryption intended for the legitimate recipient for access control. This paper gives a solid and inspiring survey of proxy re–encryption from different perspectives to offer a better understanding of this primitive. In particular, we reviewed the state-of-the-art of the proxy re–encryption by investigating the design philosophy, examining the security models and comparing the efficiency and security proofs of existing schemes. Furthermore, the potential applications and extensions of proxy re–encryption have also been discussed. Finally, this paper is concluded with a summary of the possible future work.
27 .Attribute-based Access Control with Constant-size Cipher text in Cloud Computing
ABSTRACT
With the popularity of cloud computing, there have been increasing concerns about its security and privacy. Since the cloud computing environment is distributed and untrusted, data owners have to encrypt outsourced data to enforce confidentiality. Therefore, how to achieve practicable access control of encrypted data in an untrusted environment is an urgent issue that needs to be solved. Attribute–Based Encryption (ABE) is a promising scheme suitable for access control in cloud storage systems. This paper proposes a hierarchical attribute–based access control scheme with constant–size cipher text. The scheme is efficient because the length of cipher text and the number of bilinear pairing evaluations to a constant are fixed. Its computation cost in encryption and decryption algorithms is low. Moreover, the hierarchical authorization structure of our scheme reduces the burden and risk of a single authority scenario. We prove the scheme is of CCA2 security under the decisional q-Bilinear Diffie-Hellman Exponent assumption. In addition, we implement our scheme and analyse its performance. The analysis results show the proposed scheme is efficient, scalable, and fine-grained in dealing with access control for outsourced data in cloud computing.
28 .A Context-Aware Architecture Supporting Service Availability in Mobile Cloud Computing
ABSTRACT
Mobile systems are gaining more and more importance, and new promising paradigms like Mobile Cloud Computing are emerging. Mobile Cloud Computing provides an infrastructure where data storage and processing could happen outside the mobile node. Specifically, there is a major interest in the use of the services obtained by taking advantage of the distributed resource pooling provided by nearby mobile nodes in a transparent way. This kind of systems is useful in application domains such as emergencies, education and tourism. However, these systems are commonly based on dynamic network topologies, in which disconnections and network partitions can occur frequently, and thus the availability of the services is usually compromised. Techniques and methods from Autonomic Computing can be applied to Mobile Cloud Computing to build dependable service models taking into account changes in the context. In this work, context-aware software architecture is proposed to support the availability of the services deployed in mobile and dynamic network environments. The proposal is based on a service replication scheme together with a self-configuration approach for the activation/hibernation of the replicas of the service depending on relevant context information from the mobile system. To that end, an election algorithm has been designed and implemented.
29 .Flexible and Fine-Grained Attribute-Based Data Storage in Cloud Computing
ABSTRACT
With the development of cloud computing, outsourcing data to cloud server attracts lots of attentions. To guarantee the security and achieve flexibly fine–grained file access control, attribute based encryption (ABE) was proposed and used in cloud storage system. However, user revocation is the primary issue in ABE schemes. In this article, we provide a cipher text-policy attribute based encryption (CP-ABE) scheme with efficient user revocation for cloud storage system. The issue of user revocation can be solved efficiently by introducing the concept of user group. When any user leaves, the group manager will update users’ private keys except for those who have been revoked. Additionally, CP-ABE scheme has heavy computation cost, as it grows linearly with the complexity for the access structure. To reduce the computation cost, we outsource high computation load to cloud service providers without leaking file content and secret keys. Notably, our scheme can withstand collusion attack performed by revoked users cooperating with existing users. We prove the security of our scheme under the divisible computation Diffie-Hellman (DCDH) assumption. The result of our experiment shows computation cost for local devices is relatively low and can be constant. Our scheme is suitable for resource constrained devices.
30 .Fair Resource Allocation for Data-Intensive Computing in the Cloud
ABSTRACT
To address the computing challenge of ’big data’, a number of data–intensive computing frameworks (e.g., Map Reduce, Dryad, Storm and Spark) have emerged and become popular. YARN is a de facto resource management platform that enables these frameworks running together in a shared system. However, we observe that, in cloud computing environment, the fair resource allocation policy implemented in YARN is not suitable because of its memory less resource allocation fashion leading to violations of a number of good properties in shared computing systems. This paper attempts to address these problems for YARN. Both single level and hierarchical resource allocations are considered. For single-level resource allocation, we propose a novel fair resource allocation mechanism called Long-Term Resource Fairness (LTRF) for such computing. For hierarchical resource allocation, we propose Hierarchical Long-Term Resource Fairness (H-LTRF) by extending LTRF. We show that both LTRF and H-LTRF can address these fairness problems of current resource allocation policy and are thus suitable for cloud computing. Finally, we have developed LTYARN by implementing LTRF and H-LTRF in YARN, and our experiments show that it leads to a better resource fairness than existing fair schedulers of YARN.
31 .Secure Data Sharing in Cloud Computing Using Revocable-Storage Identity-Based Encryption
ABSTRACT
Cloud computing is an Internet-based computing pattern through which shared resources are provided to devices on demand. Its an emerging but promising paradigm to integrating mobile devices into cloud computing, and the integration performs in the cloud based hierarchical multi-user data-shared environment. With integrating into cloud computing, security issues such as data confidentiality and user authority may arise in the mobile cloud computing system, and it is concerned as the main constraints to the developments of mobile cloud computing. In order to provide safe and secure operation, a hierarchical access control method using modified hierarchical attribute-based encryption (M-HABE) and a modified three-layer structure is proposed in this paper. In a specific mobile cloud computing model, enormous data which may be from all kinds of mobile devices, such as smart phones, functioned phones and PDAs and so on can be controlled and monitored by the system, and the data can be sensitive to unauthorized third party and constraint to legal users as well. The novel scheme mainly focuses on the data processing, storing and accessing, which is designed to ensure the users with legal authorities to get corresponding classified data and to restrict illegal users and unauthorized legal users get access to the data, which makes it extremely suitable for the mobile cloud computing paradigms.
32 .Knowledge-Based Resource Allocation for Collaborative Simulation Development in a Multi-tenant Cloud Computing Environment
ABSTRACT
Cloud computing technologies have enabled a new paradigm for advanced product development powered by the provision and subscription of computational services in a multi–tenant distributed simulation environment. The description of computational resources and their optimal allocation among tenants with different requirements holds the key to implementing effective software systems for such a paradigm. To address this issue, a systematic framework for monitoring, analyzing and improving system performance is proposed in this research. Specifically, a radial basis function neural network is established to transform simulation tasks with abstract descriptions into specific resource requirements in terms of their quantities and qualities. Additionally, a novel mathematical model is constructed to represent the complex resource allocation process in a multi–tenant computing environment by considering priority-based tenant satisfaction, total computational cost and multi-level load balance. To achieve optimal resource allocation, an improved multi-objective genetic algorithm is proposed based on the elitist archive and the K-means approaches. As demonstrated in a case study, the proposed framework and methods can effectively support the cloud simulation paradigm and efficiently meet tenants’ computational requirements in a distributed environment.
33 .KSF-OABE: Outsourced Attribute-Based Encryption with Keyword Search Function for Cloud Storage
ABSTRACT
Cloud computing becomes increasingly popular for data owners to outsource their data to public cloud servers while allowing intended data users to retrieve these data stored in cloud. This kind of computing model brings challenges to the security and privacy of data stored in cloud. Attribute–based encryption (ABE) technology has been used to design fine-grained access control system, which provides one good method to solve the security issues in cloud setting. However, the computation cost and cipher text size in most ABE schemes grow with the complexity of the access policy. Outsourced ABE (OABE) with fine-grained access control system can largely reduce the computation cost for users who want to access encrypted data stored in cloud by outsourcing the heavy computation to cloud service provider (CSP). However, as the amount of encrypted files stored in cloud is becoming very huge, which will hinder efficient query processing? To deal with above problem, we present a new cryptographic primitive called attribute–based encryption scheme with outsourcing key-issuing and outsourcing decryption, which can implement keyword search function (KSF–OABE). The proposed KSF–OABE scheme is proved secure against chosen-plaintext attack (CPA). CSP performs partial decryption task delegated by data user without knowing anything about the plaintext. Moreover, the CSP can perform encrypted keyword search without knowing anything about the keywords embedded in trapdoor
34 .A Trust Label System for Communicating Trust in Cloud Services
ABSTRACT
Cloud computing is rapidly changing the digital service landscape. A proliferation of Cloud providers has emerged, increasing the difficulty of consumer decisions. Trust issues have been identified as a factor holding back Cloud adoption. The risks and challenges inherent in the adoption of Cloud services are well recognised in the computing literature. In conjunction with these risks, the relative novelty of the online environment as a context for the provision of business services can increase consumer perceptions of uncertainty. This uncertainty is worsened in a Cloud context due to the lack of transparency, from the consumer perspective, into the service types, operational conditions and the quality of service offered by the diverse providers. Previous approaches failed to provide an appropriate medium for communicating trust and trustworthiness in Clouds. A new strategy is required to improve consumer confidence and trust in Cloud providers. This paper presents the operationalisation of a trust label system designed to communicate trust and trustworthiness in Cloud services. We describe the technical details and implementation of the trust label components. Based on a use case scenario, an initial evaluation was carried out to test its operations and its usefulness for increasing consumer trust in Cloud services.
35 .Towards Trustworthy Multi-Cloud Services Communities: A Trust-based Hedonic Coalitional Game
ABSTRACT
The prominence of cloud computing led to unprecedented proliferation in the number of Web services deployed in cloud data centres. In parallel, service communities have gained recently increasing interest due to their ability to facilitate discovery, composition, and resource scaling in large-scale services’ markets. The problem is that traditional community formation models may work well when all services reside in a single cloud but cannot support a multi–cloud environment. Particularly, these models overlook having malicious services that misbehave to illegally maximize their benefits and that arises from grouping together services owned by different providers. Besides, they rely on a centralized architecture whereby a central entity regulates the community formation; which contradicts with the distributed nature of cloud–based services. In this paper, we propose a three-fold solution that includes: trust establishment framework that is resilient to collusion attacks that occur to mislead trust results; bootstrapping mechanism that capitalizes on the endorsement concept in online social networks to assign initial trust values; and trust–based hedonic coalitional game that enables services to distributive form trustworthy multi–cloud communities. Experiments conducted on a real-life dataset demonstrate that our model minimizes the number of malicious services compared to three state-of-the-art cloud federations and service communities’ models.
36 .Cost Effective, Reliable and Secure Workflow Deployment over Federated Clouds
ABSTRACT
The significant growth in cloud computing has led to increasing number of cloud providers, each offering their service under different conditions – one might be more secure whilst another might be less expensive or more reliable. At the same time user applications have become more and more complex. Often, they consist of a diverse collection of software components, and need to handle variable workloads, which poses different requirements on the infrastructure. Therefore, many organisations are considering using a combination of different clouds to satisfy these needs. It raises, however, a non-trivial issue of how to select the best combination of clouds to meet the application requirements. This paper presents a novel algorithm to deploy workflow applications on federatedclouds. Firstly, we introduce an entropy-based method to quantify the most reliableworkflowdeployments. Secondly, we apply an extension of the Bell-LaPadula Multi-Level security model to address application security requirements. Finally, we optimise deployment in terms of its entropy and also its monetary cost, taking into account the cost of computing power, data storage and inter-cloud communication. We implemented our new approach and compared it against two existing scheduling algorithms: Extended Dynamic Constraint Algorithm (EDCA) and Extended Biobjective dynamic level scheduling (EBDLS). We show that our algorithm can find deployments that are of equivalent reliability but are less expensive and meet security requirements. We have validated our solution through a set of realistic scientific workflows, using well-known cloud simulation tools (WorkflowSim and DynamicCloudSim) and a realistic cloud based data analysis system (e-Science Central).
37 .Circuit Ciphertext-Policy Attribute-Based Hybrid Encryption with Verifiable Delegation in Cloud Computing
ABSTRACT
In the cloud, for achieving access control and keeping data confidential, the data owners could adopt attribute-basedencryption to encrypt the stored data. Users with limited computing power are however more likely to delegate the mask of the decryption task to the cloud servers to reduce the computing cost. As a result, attribute-basedencryption with delegation emerges. Still, there are caveats and questions remaining in the previous relevant works. For instance, during the delegation, the cloud servers could tamper or replace the delegated ciphertext and respond a forged computing result with malicious intent. They may also cheat the eligible users by responding them that they are ineligible for the purpose of cost saving. Furthermore, during the encryption, the access policies may not be flexible enough as well. Since policy for general circuits enables to achieve the strongest form of access control, a construction for realizing circuitciphertext-policyattribute-basedhybridencryption with verifiabledelegation has been considered in our work. In such a system, combined with verifiable computation and encrypt-then-mac mechanism, the data confidentiality, the fine-grained access control and the correctness of the delegated computing results are well guaranteed at the same time. Besides, our scheme achieves security against chosen-plaintext attacks under the k-multilinear Decisional Diffie-Hellman assumption. Moreover, an extensive simulation campaign confirms the feasibility and efficiency of the proposed solution.
38 .SecRBAC: Secure data in the Clouds
ABSTRACT
Most current security solutions are based on perimeter security. However, Cloud computing breaks the organization perimeters. When data resides in the Cloud, they reside outside the organizational bounds. This leads users to a loos of control over their data and raises reasonable security concerns that slow down the adoption of Cloud computing. Is the Cloud service provider accessing the data? Is it legitimately applying the access control policy defined by the user? This paper presents a data-centric access control solution with enriched role-based expressiveness in which security is focused on protecting user data regardless the Cloud service provider that holds it. Novel identity-based and proxy re-encryption techniques are used to protect the authorization model. Data is encrypted and authorization rules are cryptographically protected to preserve user data against the service provider access or misbehavior. The authorization model provides high expressiveness with role hierarchy and resource hierarchy support. The solution takes advantage of the logic formalism provided by Semantic Web technologies, which enables advanced rule management like semantic conflict detection. A proof of concept implementation has been developed and a working prototypical deployment of the proposal has been integrated within Google services.
39 .Joint Energy Minimization and Resource Allocation in C-RAN with Mobile Cloud
ABSTRACT
Cloud radio access network (C–RAN) has emerged as a potential candidate of the next generation access network technology to address the increasing mobile traffic, while mobile cloud computing (MCC) offers a prospective solution to the resource-limited mobile user in executing computation intensive tasks. Taking full advantages of above two cloud-based techniques, C–RAN with MCC are presented in this paper to enhance both performance and energy efficiencies. In particular, this paper studies the joint energy minimization and resource allocation in C–RAN with MCC under the time constraints of the given tasks. We first review the energy and time model of the computation and communication. Then, we formulate the joint energy minimization into a non-convex optimization with the constraints of task executing time, transmitting power, computation capacity and front haul data rates. This non-convex optimization is then reformulated into an equivalent convex problem based on weighted minimum mean square error (WMMSE). The iterative algorithm is finally given to deal with the joint resource allocation in C–RAN with mobile cloud. Simulation results confirm that the proposed energy minimization and resource allocation solution can improve the system performance and save energy.
40 .Probabilistic Optimization of Resource Distribution and Encryption for Data Storage in the Cloud
ABSTRACT
In this paper, we develop a decentralized probabilistic method for performance optimization of cloud services. We focus on Infrastructure-as-a-Service where the user is provided with the ability of configuring virtual resources on demand in order to satisfy specific computational requirements. This novel approach is strongly supported by a theoretical framework based on tail probabilities and sample complexity analysis. It allows not only the inclusion of performance metrics for the cloud but the incorporation of security metrics based on cryptographic algorithms for data storage. To the best of the authors’ knowledge this is the first unified approach to provision performance and security on demand subject to the Service Level Agreement between the client and the cloud service provider. The quality of the service is guaranteed given certain values of accuracy and confidence. We present some experimental results using the Amazon Web Services, Amazon Elastic Compute Cloud service to validate our probabilistic optimization method.
41 .Collective Energy-Efficiency Approach to Data Center Networks Planning
ABSTRACT
Energy efficiency of data centers (DCs) has become a major concern as DCs continue to grow large often hosting tens of thousands of servers or even hundreds of thousands of them. Clearly, such a volume of DCs implies scale of data center network (DCN) with a huge number of network nodes and links. The energy consumption of this communication network has skyrocketed and become the same league as computing servers ‘costs. With the ever-increasing amount of data that need to be stored and processed in DCs, DCN traffic continues to soar drawing increasingly more power. In particular, more than one-third of the total energy in DCs is consumed by communication links, switching and aggregation elements. In this paper, we concern the energy efficiency of data center explicitly taking into account both servers and DCN. To this end, we present VPTCA, as a collective energy–efficiency approach to data center network planning, which deals with virtual machine (VM) placement and communication traffic configuration. VPTCA aims particularly to reduce the energy consumption of DCN by assigning interrelated VMs into the same server or pod, which effectively helps reduce the amount of transmission load. In the layer of traffic message, VPTCA optimally uses switch ports and link bandwidth to balance the load and avoid congestions, enabling DCN to increase its transmission capacity, and saving a significant amount of network energy. In our evaluation via NS-2 simulations, the performance of VPTCA is measured and compared with two well-known DCN management algorithms, Global First Fit and Elastic Tree. Based on our experimental results, VPTCA outperforms existing algorithms in providing DCN more transmission capacity with less energy consumption.
42 .Middleware-oriented Deployment Automation for Cloud Applications
ABSTRACT
Fully automated provisioning and deployment of applications is one of the most essential prerequisites to make use of the benefits of Cloud computing in order to reduce the costs for managing applications. A huge variety of approaches, tools, and providers are available to automate the involved processes. The DevOps community, for instance, provides tooling and reusable artifacts to implement deployment automation in an application oriented manner. Platform-as-a-Service frameworks are available for the same purpose. In this work we systematically classify and characterize available deployment approaches independently from the underlying technology used. For motivation and evaluation purposes, we choose Web applications with different technology stacks and analyze their specific deployment requirements. Afterwards, we provision these applications using each of the identified types of deployment approaches in the Cloud to perform qualitative and quantitative measurements. Finally, we discuss the evaluation results and derive recommendations to decide which deployment approach to use based on the deployment requirements of an application. Our results show that deployment approaches can also be efficiently combined if there is no ‘best fit’ for a particular application.
43 .Trust-but-Verify: Verifying Result Correctness of Outsourced Frequent Item set Mining in Data-Mining-As-a-Service Paradigm
ABSTRACT
Cloud computing is popularizing the computing paradigm in which data is outsourced to a third-party service provider (server) for data mining. Outsourcing, however, raises a serious security issue: how can the client of weak computational power verify that the server returned correct mining result? In this paper, we focus on the specific task of frequent item set mining. We consider the server that is potentially untrusted and tries to escape from verification by using its prior knowledge of the outsourced data. We propose efficient probabilistic and deterministic verification approaches to check whether the server has returned correct and complete frequent item sets. Our probabilistic approach can catch incorrect results with high probability, while our deterministic approach measures the result correctness with 100 percent certainty. We also design efficient verification methods for both cases that the data and the mining setup are updated. We demonstrate the effectiveness and efficiency of our methods using an extensive set of empirical results on real datasets.
44 .Energy-efficient Adaptive Resource Management for Real-time Vehicular Cloud Services
ABSTRACT
Providing real–time cloud services to Vehicular Clients (VCs) must cope with delay and delay-jitter issues. Fog computing is an emerging paradigm that aims at distributing small-size self-powered data centers (e.g., Fog nodes) between remote Clouds and VCs, in order to deliver data-dissemination real–time services to the connected VCs. Motivated by these considerations, in this paper, we propose and test an energy–efficient adaptive resource scheduler for Networked Fog Centers (NetFCs). They operate at the edge of the vehicular network and are connected to the served VCs through Infrastructure-to-Vehicular (I2V) TCP/IP-based single-hop mobile links. The goal is to exploit the locally measured states of the TCP/IP connections, in order to maximize the overall communication-plus-computing energy efficiency, while meeting the application-induced hard QoS requirements on the minimum transmission rates, maximum delays and delay-jitters. The resulting energy–efficient scheduler jointly performs: (i) admission control of the input traffic to be processed by the NetFCs; (ii) minimum-energy dispatching of the admitted traffic; (iii) adaptive reconfiguration and consolidation of the Virtual Machines (VMs) hosted by the NetFCs; and, (iv) adaptive control of the traffic injected into the TCP/IP mobile connections. The salient features of the proposed scheduler are that: (i) it is adaptive and admits distributed and scalable implementation; and, (ii) it is capable to provide hard QoS guarantees, in terms of minimum/maximum instantaneous rates of the traffic delivered to the vehicular clients, instantaneous rate-jitters and total processing delays. Actual performance of the proposed scheduler in the presence of: (i) client mobility; (ii) wireless fading; and, (iii) reconfiguration and consolidation costs of the underlying NetFCs, is numerically tested and compared against the corresponding ones of some state-of-the-art schedulers, under both synthetically generated and measured – eal-world workload traces.
45 .Cloud Service Reliability Enhancement via Virtual Machine Placement Optimization
ABSTRACT
With rapid adoption of the cloud computing model, many enterprises have begun deploying cloud-based services. Failures of virtual machines (VMs) in clouds have caused serious quality assurance issues for those services. VM replication is a commonly used technique for enhancing the reliability of cloud services. However, when determining the VM redundancy strategy for a specific service, many state-of-the-art methods ignore the huge network resource consumption issue that could be experienced when the service is in failure recovery mode. This paper proposes a redundant VM placement optimization approach to enhancing the reliability of cloud services. The approach employs three algorithms. The first algorithm selects an appropriate set of VM-hosting servers from a potentially large set of candidate host servers based upon the network topology. The second algorithm determines an optimal strategy to place the primary and backup VMs on the selected host servers with k-fault-tolerance assurance. Lastly, a heuristic is used to address the task-to-VM reassignment optimization problem, which is formulated as finding a maximum weight matching in bipartite graphs. The evaluation results show that the proposed approach outperforms four other representative methods in network resource consumption in the service recovery stage.
46 .A Novel Statistical Cost Model and an Algorithm for Efficient Application Offloading to Clouds
ABSTRACT
This work presents a novel statistical cost model for applications that can be offloaded to cloud computing environments. The model constructs a tree structure, referred to as the execution dependency tree (EDT), to accurately represent various execution relations, or dependencies (e.g., sequential, parallel and conditional branching) among the application modules, along its different execution paths. Contrary to existing models that assume fixed average offloading costs, each module’s cost is modelled as a random variable described by its Cumulative Distribution Function (CDF) that is statistically estimated through application profiling. Using this model, we generalize the offloading cost optimization functions to those that use more user tailored statistical measures such as cost percentiles. We employ these functions to propose an efficient offloading algorithm based on a dynamic programming formulation. We also show that the proposed model can be used as an efficient tool for application analysis by developers to gain insights on the applications’ statistical performance under varying network conditions and users behaviours. Performance evaluation results show that the achieved mean absolute percentage error between the model-based estimated cost and the measured one for the application execution time can be as small as 5% for applications with sequential and branching module dependencies.
47 .Packet Cloud: A Cloudlet-Based Open Platform for In-Network Services
ABSTRACT
The Internet was designed with the end-to-end principle where the network layer provided merely the best-effort forwarding service. This design makes it challenging to add new services into the Internet infrastructure. However, as the Internet connectivity becomes a commodity, users and applications increasingly demand new in-network services. This paper proposes Packet Cloud, a cloudlet–based open platform to host in-network services. Different from standalone, specialized middle boxes, cloudlets can efficiently share a set of commodity servers among different services, and serve the network traffic in an elastic way. Packet Cloud can help both Internet Service Providers (ISPs) and emerging application/content providers deploy their services at strategic network locations. We have implemented a proof-of-concept prototype of Packet Cloud. Packet Cloud introduces a small additional delay, and can scale well to handle high-throughput data traffic. We have evaluated Packet Cloud in both a fully functional emulated environment, and the real Internet.
48 .A Dynamical and Load-Balanced Flow Scheduling Approach for Big Data Centers in Clouds
ABSTRACT
Load–balanced flow scheduling for bigdata centers in clouds, in which a large amount of data needs to be transferred frequently among thousands of interconnected servers, is a key and challenging issue. The Open Flow is a promising solution to balance data flows in a data center network through its programmatic traffic controller. Existing Open Flow based scheduling schemes, however, statically set up routes only at the initialization stage of data transmissions, which suffers from dynamical flow distribution and changing network states in data centers and often results in poor system performance. In this paper, we propose a novel dynamical load–balanced scheduling (DLBS) approach for maximizing the network throughput while balancing workload dynamically. We firstly formulate the DLBS problem, and then develop a set of efficient heuristic scheduling algorithms for the two typical OpenFlow network models, which balance data flows time slot by time slot. Experimental results demonstrate that our DLBS approach significantly outperforms other representative load–balanced scheduling algorithms Round Robin and LOBUS; and the higher imbalance degree data flows in data centers exhibit, the more improvement our DLBS approach will bring to the data centers.
49 .Feedback Autonomic Provisioning for Guaranteeing Performance in Map Reduce Systems
ABSTRACT
Companies have fast growing amounts of data to process and store, a data explosion is happening next to us. Currently one of the most common approaches to treat these vast data quantities are based on the Map Reduce parallel programming paradigm. While its use is widespread in the industry, ensuring performance constraints, while at the same time minimizing costs, still provides considerable challenges. We propose a coarse grained control theoretical approach, based on techniques that have already proved their usefulness in the control community. We introduce the first algorithm to create dynamic models for Big Data Map Reduce systems, running a concurrent workload. Furthermore, we identify two important control use cases: relaxed performance – minimal resource and strict performance. For the first case we develop two feedback control mechanism. A classical feedback controller and an even based feedback that minimises the number of cluster reconfigurations as well. Moreover, to address strict performance requirements a feed forward predictive controller that efficiently suppresses the effects of large workload size variations is developed. All the controllers are validated online in a benchmark running in a real 60 node Map Reduce cluster, using a data intensive Business Intelligence workload. Our experiments demonstrate the success of the control strategies employed in assuring service time constraints.
50 .Effective Modelling Approach for IaaS Data Center Performance Analysis under Heterogeneous Workload
ABSTRACT
Heterogeneity prevails not only among physical machines but also among workloads in real IaaS Cloud data centers (CDCs). The heterogeneity makes performance modelling of large and complex IaaS CDCs even more challenging. This paper considers the scenario where the number of virtual CPUs requested by each customer job may be different. We propose a hierarchical stochastic modelling approach applicable to IaaS CDC performance analysis under such a heterogeneous work load. Numerical results obtained from the proposed analytic model are verified through discrete-event simulations under various system parameter settings.
51 .An Energy-Efficient VM Prediction and Migration Framework for Overcommitted Clouds
ABSTRACT
We propose an integrated, energy–efficient, resource allocation framework for over committed clouds. The framework makes great energy savings by 1) minimizing Physical Machine (PM) overload occurrences via VM resource usage monitoring and prediction, and 2) reducing the number of active PMs via efficient VM migration and placement. Using real Google data consisting of a 29-day traces collected from a cluster containing more than 12K PMs, we show that our proposed framework outperforms existing overload avoidance techniques and prior VM migration strategies by reducing the number of unpredicted overloads, minimizing migration overhead, increasing resource utilization, and reducing cloud energy consumption.
52 .Identity-Based Encryption with Cloud Revocation Authority and Its Applications
ABSTRACT
Identity-based encryption (IBE) is a public key cryptosystem and eliminates the demands of public key infrastructure (PKI) and certificate administration in conventional public key settings. Due to the absence of PKI, the revocation problem is a critical issue in IBE settings. Several revocable IBE schemes have been proposed regarding this issue. Quite recently, by embedding an outsourcing computation technique into IBE, Li et al. proposed a revocable IBE scheme with a key-update cloud service provider (KU-CSP). However, their scheme has two shortcomings. One is that the computation and communication costs are higher than previous revocable IBE schemes. The other shortcoming is lack of scalability in the sense that the KU-CSP must keep a secret value for each user. In the article, we propose a new revocable IBE scheme with a cloud revocation authority (CRA) to solve the two shortcomings, namely, the performance is significantly improved and the CRA holds only a system secret for all the users. For security analysis, we demonstrate that the proposed scheme is semantically secure under the decisional bilinear Diffie-Hellman (DBDH) assumption. Finally, we extend the proposed revocable IBE scheme to present a CRA-aided authentication scheme with period-limited privileges for managing a large number of various cloud services.
53 .A Cloud Gaming System Based on User-Level Virtualization and Its Resource Scheduling
ABSTRACT
Many believe the future of gaming lies in the cloud, namely Cloud Gaming, which renders an interactive gaming application in the cloud and streams the scenes as a video sequence to the player over Internet. This paper proposes GCloud, a GPU/CPU hybrid cluster for cloud gaming based on the user–level virtualization technology. Specially, we present a performance model to analyze the server-capacity and games‘ resource-consumptions, which categorizes games into two types: CPU-critical and memory-of-critical. Consequently, several scheduling strategies have been proposed to improve the resource-utilization and compared with others. Simulation tests show that both of the First-Fit-like and the Best-Fit-like strategies outperform the other(s); especially they are near optimal in the batch processing mode. Other test results indicate that GCloud is efficient: An off-the-shelf PC can support five high-end video-games run at the same time. In addition, the average per-frame processing delay is 8~19 ms under different image-resolutions, which outperforms other similar solutions.
54 .Optimal Joint Scheduling and Cloud Offloading for Mobile Applications
ABSTRACT
Cloud offloading is an indispensable solution to supporting computationally demanding applications on resource constrained mobile devices. In this paper, we introduce the concept of wireless aware joint scheduling and computation offloading (JSCO) for multi component applications, where an optimal decision is made on which components need to be offloaded as well as the scheduling order of these components. The JSCO approach allows for more degrees of freedom in the solution by moving away from a compiler predetermined scheduling order for the components towards a more wireless aware scheduling order. For some component dependency graph structures, the proposed algorithm can shorten execution times by parallel processing appropriate components in the mobile and cloud. We define a net utility that trades-off the energy saved by the mobile, subject to constraints on the communication delay, overall application execution time, and component precedence ordering. The linear optimization problem is solved using real data measurements obtained from running multi-component applications on an HTC smart phone and the Amazon EC2, using Wi-Fi for cloud offloading. The performance is further analyzed using various component dependency graph topologies and sizes. Results show that the energy saved increases with longer application runtime deadline, higher wireless rates, and smaller offload data sizes.
55 .An Efficient Privacy-Preserving Ranked Keyword Search Method
ABSTRACT
Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy preserving. Therefore it is essential to develop efficient and reliable cipher text search techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design cipher text search schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast cipher text search within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rank privacy and relevance of retrieved documents.
56 .A Taxonomy of Job Scheduling on Distributed Computing Systems
ABSTRACT
Hundreds of papers on job scheduling for distributed systems are published every year and it becomes increasingly difficult to classify them. Our analysis revealed that half of these papers are barely cited. This paper presents a general taxonomy for scheduling problems and solutions in distributed systems. This taxonomy was used to classify and make publicly available the classification of 109 scheduling problems and their solutions. These 109 problems were further clustered into ten groups based on the features of the taxonomy. The proposed taxonomy will facilitate researchers to build on prior art, increase new research visibility, and minimize redundant effort.
57 .Lazy Ctrl: A Scalable Hybrid Network Control Plane Design for Cloud Data Centers
ABSTRACT
The advent of software defined networking enables flexible, reliable and feature-rich control planes for data center networks. However, the tight coupling of centralized control and complete visibility leads to a wide range of issues among which scalability has risen to prominence due to the excessive workload on the central controller. By analyzing the traffic patterns from a couple of production data centers, we observe that data center traffic is usually highly skewed and thus edge switches can be clustered into a set of communication intensive groups according to traffic locality. Motivated by this observation, we present Lazy Ctrl, a novel hybrid control plane design for data center networks where network control is carried out by distributed control mechanisms inside independent groups of switches while complemented with a global controller. Lazy Ctrl aims at bringing laziness to the global controller by dynamically devolving most of the control tasks to independent switch groups to process frequent intra-group events near the data path while handling rare inter-group or other specified events by the controller. We implement Lazy Ctrl and build a prototype based on Open vSwitch and Floodlight. Trace driven experiments on our prototype show that an effective switch grouping is easy to maintain in multi-tenant clouds and the central controller can be significantly shielded by staying “lazy”, with its workload reduced by up to 82%.
58 .Ensemble: A Tool for Performance Modelling of Applications in Cloud Data Centers
ABSTRACT
We introduce Ensemble, a runtime framework and associated tools for building application performance models on-the-fly. These dynamic performance models can be used to support complex, highly dimensional resource allocation, and/or what-if performance inquiry in modern heterogeneous environments, such as data centers and Clouds. Ensemble combines simple, partially specified, and lower-dimensionality models to provide good initial approximations for higher dimensionality application performance models. We evaluated Ensemble on industry-standard and scientific applications. The results show that Ensemble provides accurate, fast, and flexible performance models even in the presence of significant environment variability.
59 .Auto Elastic: Automatic Resource Elasticity for High Performance Applications in the Cloud
ABSTRACT
Elasticity is undoubtedly one of the most striking characteristics of cloud computing. Especially in the area of high performance computing (HPC), elasticity can be used to execute irregular and CPU-intensive applications. However, the on- the-fly increase/decrease in resources is more widespread in Web systems, which have their own IaaS-level load balancer. Considering the HPC area, current approaches usually focus on batch jobs or assumptions such as previous knowledge of application phases, source code rewriting or the stop-reconfigure-and-go approach for elasticity. In this context, this article presents Auto Elastic, a PaaS-level elasticity model for HPC in the cloud. Its differential approach consists of providing elasticity for high performance applications without user intervention or source code modification. The scientific contributions of Auto Elastic are twofold: (i) an Aging-based approach to resource allocation and deallocation actions to avoid unnecessary virtual machine (VM) reconfigurations (thrashing) and (ii) asynchronism in creating and terminating VMs in such a way that the application does not need to wait for completing these procedures. The prototype evaluation using Open Nebula middleware showed performance gains of up to 26 percent in the execution time of an application with the Auto Elastic manager. Moreover, we obtained low intrusiveness for Auto Elastic when reconfigurations do not occur.
60 .Supporting Multi Data Stores Applications in Cloud Environments
ABSTRACT
The production of huge amount of data and the emergence of cloud computing have introduced new requirements for data management. Many applications need to interact with several heterogeneous data stores depending on the type of data they have to manage: traditional data types, documents, graph data from social networks, simple key-value data, etc. Interacting with heterogeneous data models via different APIs, and multiple data store applications imposes challenging tasks to their developers. Indeed, programmers have to be familiar with different APIs. In addition, the execution of complex queries over heterogeneous data models cannot, currently, be achieved in a declarative way as it is used to be with mono-data store application, and therefore requires extra implementation efforts. Moreover, developers need to master and deal with the complex processes of cloud discovery, and application deployment and execution. In this paper we propose an integrated set of models, algorithms and tools aiming at alleviating developers task for developing, deploying and migrating multiple data stores applications in cloud environments. Our approach focuses mainly on three points. First, we provide a unifying data model used by applications developers to interact with heterogeneous relational and NoSQL data stores. Based on that, they express queries using OPEN-PaaS-Database API (ODBAPI), unique REST API allowing programmers to write their applications code independently of the target data stores. Second, we propose virtual data stores, which act as a mediator and interact with integrated data stores wrapped by ODBAPI. This run-time component supports the execution of single and complex queries over heterogeneous data stores. Finally, we present a declarative approach that enables to lighten the burden of the tedious and non-standard tasks of (1) discovering relevant cloud environment and (2) deploying applications on them while letting developers to simply focus on specifying their storage and computing requirements. A prototype of the proposed solution has been developed and is currently used to implement use cases from the Open PaaS project.
61 .Coral: A Cloud-Backed Frugal File System
ABSTRACT
With simple access interfaces and flexible billing models, cloud storage has become an attractive solution to simplify the storage management for both enterprises and individual users. However, traditional file systems with extensive optimizations for local disk-based storage backend cannot fully exploit the inherent features of the cloud to obtain desirable performance. In this paper, we present the design, implementation, and evaluation of Coral, a cloud based file system that strikes a balance between performance and monetary cost. Unlike previous studies that treat cloud storage as just a normal backend of existing networked file systems, Coral is designed to address several key issues in optimizing cloud-based file systems such as the data layout, block management, and billing model. With carefully designed data structures and algorithms, such as identifying semantically correlated data blocks, kd-tree based caching policy with self-adaptive thrashing prevention, effective data layout, and optimal garbage collection, Coral achieves good performance and cost savings under various workloads as demonstrated by extensive evaluations.
62 .EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud
ABSTRACT
The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable trade-off between precision and recall than that of other similarity detection algorithms. Theref- re, it is an effective approach of similarity identification for the cloud.
63 .TMACS: A Robust and Verifiable Threshold Multi-Authority Access Control System in Public Cloud Storage
ABSTRACT
Attribute-based Encryption (ABE) is regarded as a promising cryptographic conducting tool to guarantee data owners’ direct control over their data in public cloud storage. The earlier ABE schemes involve only one authority to maintain the whole attribute set, which can bring a single-point bottleneck on both security and performance. Subsequently, some multi–authority schemes are proposed, in which multiple authorities separately maintain disjoint attribute subsets. However, the single-point bottleneck problem remains unsolved. In this paper, from another perspective, we conduct a threshold multi–authority CP-ABE access control scheme for public cloud storage, named TMACS, in which multiple authorities jointly manage a uniform attribute set. In TMACS, taking advantage of ( ) threshold secret sharing, the master key can be shared among multiple authorities, and a legal user can generate his/her secret key by interacting with any authorities. Security and performance analysis results show that TMACS is not only verifiable secure when less than authorities are compromised, but also robust when no less than authorities are alive in the system. Furthermore, by efficiently combining the traditional multi–authority scheme with TMACS, we construct a hybrid one, which satisfies the scenario of attributes coming from different authorities as well as achieving security and system-level robustness.
64 .Risk Assessment in a Sensor Cloud Framework Using Attack Graphs
ABSTRACT
A sensor cloud consists of various heterogeneous wireless sensor networks (WSNs). These WSNs may have different owners and run a wide variety of user applications on demand in a wireless communication medium. Hence, they are susceptible to various security attacks. Thus, a need exists to formulate effective and efficient security measures that safeguard these applications impacted from attack in the sensor cloud. However, analyzing the impact of different attacks and their cause consequence relationship is a prerequisite before security measures can be either developed or deployed. In this paper, we propose a risk assessment framework for WSNs in a sensor cloud that utilizes attack graphs. We use Bayesian networks to not only assess but also to analyze attacks on WSNs. The risk assessment framework will first review the impact of attacks on a WSN and estimate reasonable time frames that predict the degradation of WSN security parameters like confidentiality, integrity and availability. Using our proposed risk assessment framework allows the security administrator to better understand the threats present and take necessary actions against them. The framework is validated by comparing the assessment results with that of the results obtained from different simulated attack scenarios.
65 .Rep Cloud: Attesting to Cloud Service Dependency
ABSTRACT
Security enhancements to the emerging IaaS (Infrastructure as a Service) cloud computing systems have become the focus of much research, but little of this targets the underlying infrastructure. Trusted Cloud systems are proposed to integrate Trusted Computing infrastructure with cloud systems. With remote attestations, cloud customers are able to determine the genuine behaviours of their applications’ hosts; and therefore they establish trust to the cloud. However, the current Trusted Clouds have difficulties in effectively attesting to the cloud service dependency for customers’ applications, due to the cloud’s complexity, heterogeneity and dynamism. In this paper, we present Rep Cloud, a decentralized cloud trust management framework, inspired by the reputation systems from the research in peer to- peer systems. With Rep Cloud, cloud customers are able to determine the properties of the exact nodes that may affect the genuine functionalities of their applications, without obtaining much internal information of the cloud. Experiments showed that besides achieving fine-grained cloud service dependency attestation, Rep Cloud incurred lower trust management overhead than the existing trusted cloud systems.
66 .Poris: A Scheduler for Parallel Soft Real-Time Applications in Virtualized Environments
ABSTRACT
With the prevalence of cloud computing and virtualization, more and more cloud services including parallel soft real–time applications (PSRT applications) are running in virtualized data centers. However, current hypervisors do not provide adequate support for them because of soft real–time constraints and synchronization problems, which result in frequent deadline misses and serious performance degradation. CPU schedulers in underlying hypervisors are central to these issues. In this paper, we identify and analyze CPU scheduling problems in hypervisors. Then, we design and implement a parallel soft real–time scheduler according to the analysis, named Poris, based on Xen. It addresses both soft real–time constraints and synchronization problems simultaneously. In our proposed method, priority promotion and dynamic time slice mechanisms are introduced to determine when to schedule virtual CPUs (VCPUs) according to the characteristics of soft real–time applications. Besides, considering that PSRT applications may run in a virtual machine (VM) or multiple VMs, we present parallel scheduling, group scheduling and communication-driven group scheduling to accelerate synchronizations of these applications and make sure that tasks are finished before their deadlines under different scenarios. Our evaluation shows Poris can significantly improve the performance of PSRT applications no matter how they run in a VM or multiple VMs. For example, compared to the Credit scheduler, Poris decreases the response time of web search benchmark by up to 91.6 percent.
67 .Cost Minimization Algorithms for Data Center Management
ABSTRACT
Due to the increasing usage of cloud computing applications, it is important to minimize energy cost consumed by a data center, and simultaneously, to improve quality of service via data center management. One promising approach is to switch some servers in a data center to the idle mode for saving energy while to keep a suitable number of servers in the active mode for providing timely service. In this paper, we design both online and offline algorithms for this problem. For the offline algorithm, we formulate data center management as a cost minimization problem by considering energy cost, delay cost (to measure service quality), and switching cost (to change servers’s active/idle mode). Then, we analyze certain properties of an optimal solution which lead to a dynamic programming based algorithm. Moreover, by revising the solution procedure, we successfully eliminate the recursive procedure and achieve an optimal offline algorithm with a polynomial complexity. For the online algorithm, We design it by considering the worst case scenario for future workload. In simulation, we show this online algorithm can always provide near-optimal solutions.
68 .K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical and Experimental Analysis
ABSTRACT
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first paper that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data.
69 .Efficient Algorithms for Mining Top-K High Utility Itemsets
ABSTRACT
High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However, setting min_util appropriately is a difficult problem for users. Generally speaking, finding an appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set too low, too many HUIs will be generated, which may cause the mining process to be very inefficient. On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we address the above issues by proposing a new framework for top-k high utility itemset mining, where k is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining such itemsets without the need to set min_util. We provide a structural comparison of the two algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal case of state-of-the-art utility mining algorithms.
70. Mining User-Aware Rare Sequential Topic Patterns in Document Streams
ABSTRACT
Textual documents created and distributed on the Internet are ever changing in various forms. Most of existing works are devoted to topic modelling and the evolution of individual topics, while sequential relations of topics in successive documents published by a specific user are ignored. In this paper, in order to characterize and detect personalized and abnormal behaviours of Internet users, we propose Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring on abnormal user behaviours. We present a group of algorithms to solve this innovative mining problem through three phases: pre-processing to extract probabilistic topics and identify sessions for different users, generating all the STP candidates with (expected) support values for each user by pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs. Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’ characteristics.
71. Pattern Based Sequence Classification
ABSTRACT
Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. The first classifier is based on an improved version of the existing method of classification based on association rules, while the second ranks the rules by first measuring their value specific to the new data object. Experimental results show that our rule based classifiers outperform existing comparable classifiers in terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use different kinds of patterns as features to represent each sequence as a feature vector. We then apply a variety of machine learning algorithms for sequence classification, experimentally demonstrating that the patterns we discover represent the sequences well, and prove effective for the classification task.
72. ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
ABSTRACT
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques.
73. Crowd sourced Data Management: A Survey
ABSTRACT
Some important data management and analytics tasks cannot be completely addressed by automated processes. These “computer-hard” tasks such as entity resolution, sentiment analysis, and image recognition, can be enhanced through the use of human cognitive ability. Human Computation is an effective way to address such tasks by harnessing the capabilities of crowd workers (i.e., the crowd). Thus, crowd sourced data management has become an area of increasing interest in research and industry. There are three important problems in crowd sourced data management. (1) Quality Control: Workers may return noisy results and effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly in contrast to computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowd sourced tasks, developing crowd sourced data manipulation operators, and optimizing plans of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowd sourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowd sourced data management.
74. A Survey of General-Purpose Crowd sourcing Techniques
ABSTRACT
Since Jeff Howe introduced the term Crowd sourcing in 2006, this human-powered problem-solving paradigm has gained a lot of attention and has been a hot research topic in the field of Computer Science. Even though a lot of work has been conducted on this topic, so far we do not have a comprehensive survey on most relevant work done in crowd sourcing field. In this paper, we aim to offer an overall picture of the current state of the art techniques in general-purpose crowd sourcing. According to their focus, we divide this work into three parts, which are: incentive design, task assignment and quality control. For each part, we start with different problems faced in that area followed by a brief description of existing work and a discussion of pros and cons. In addition, we also present a real scenario on how the different techniques are used in implementing a location-based crowd sourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose crowd sourcing techniques and present some open problems in this area.
75. Topic Sketch: Real-time Bursty Topic Detection from Twitter
ABSTRACT
Twitter has become one of the largest micro blogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper Topic Sketch, a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that Topic Sketch on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-granularity.
76. SPIRIT: A Tree Kernel-based Method for Topic Person Interaction Detection
ABSTRACT
The development of a topic in a set of topic documents is constituted by a series of person interactions at a specific time and place. Knowing the interactions of the persons mentioned in these documents is helpful for readers to better comprehend the documents. In this paper, we propose a topic person interaction detection method called SPIRIT, which classifies the text segments in a set of topic documents that convey person interactions. We design the rich interactive tree structure to represent syntactic, context, and semantic information of text and this structure is incorporated into a tree-based convolution kernel to identify interactive segments. Experiment results based on real world topics demonstrate that the proposed rich interactive tree structure effectively detects the topic person interactions and that our method outperforms many well-known relation extraction and protein-protein interaction methods.
77. Truth Discovery in Crowd sourced Detection of Spatial Events
ABSTRACT
The ubiquity of smart phones has led to the emergence of mobile crowd sourcing tasks such as the detection of spatial events when smart phone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with low-quality data. Consequently, a major challenge in mobile crowd sourcing is truth discovery, i.e., to discover true events from diverse and noisy participants’ reports. This problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants’ mobility and reliability. Decoupling these two types of uncertainties through location tracking will raise severe privacy and energy issues, whereas simply ignoring missing reports or treating them as negative reports will significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events (PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths of events, and three-way participant reliability in a unified framework. In PTSE, we further model personal location visit tendencies. These proposed models are capable of effectively handling various types of uncertainties and automatically discovering truths without any supervision or location tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed models outperform existing state-of-the-art truth discovery approaches in the mobile crowd sourcing environment.
78. Graph Regularized Feature Selection with Data Reconstruction
ABSTRACT
Feature selection is a challenging problem for high dimensional data processing, which arises in many real applications such as data mining, information retrieval, and pattern recognition. In this paper, we study the problem of unsupervised feature selection. The problem is challenging due to the lack of label information to guide feature selection. We formulate the problem of unsupervised feature selection from the viewpoint of graph regularized data reconstruction. The underlying idea is that the selected features not only preserve the local structure of the original data space via graph regularization, but also approximately reconstruct each data point via linear combination. Therefore, the graph regularized data reconstruction error becomes a natural criterion for measuring the quality of the selected features. By minimizing the reconstruction error, we are able to select the features that best preserve both the similarity and discriminant information in the original data. We then develop an efficient gradient algorithm to solve the corresponding optimization problem. We evaluate the performance of our proposed algorithm on text clustering. The extensive experiments demonstrate the effectiveness of our proposed approach.
79. Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks
ABSTRACT
The last few years have witnessed the emergence and evolution of a vibrant research stream on a large variety of online social media network (SMN) platforms. Recognizing anonymous, yet identical users among multiple SMNs is still an intractable problem. Clearly, cross-platform exploration may help solve many problems in social computing in both theory and applications. Since public profiles can be duplicated and easily impersonated by users with different purposes, most current user identification resolutions, which mainly focus on text mining of users’ public profiles, are fragile. Some studies have attempted to match users based on the location and timing of user content as well as writing style. However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from the short sentences of leading SMNs such as Sina Micro blog and Twitter. Moreover, since online SMNs are quite symmetric, existing user identification schemes based on network structure are not effective. The real-world friend cycle is highly individual and virtually no two users share a congruent friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross-platform SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we proposed the Friend Relationship-Based User Identification (FRUI) algorithm. FRUI calculates a match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are considered as identical users. We also developed two propositions to improve the efficiency of the algorithm. Results of extensive experiments demonstrate that FRUI performs much better than current network structure-based algorithms.
80. Taxo Finder: A Graph-Based Approach for Taxonomy Learning
ABSTRACT
Taxonomy learning is an important task for knowledge acquisition, sharing, and classification as well as application development and utilization in various domains. To reduce human effort to build taxonomy from scratch and improve the quality of the learned taxonomy, we propose a new taxonomy learning approach, named Taxo Finder. Taxo Finder takes three steps to automatically build taxonomy. First, it identifies domain-specific concepts from a domain text corpus. Second, it builds a graph representing how such concepts are associated together based on their co-occurrences. As the key method in Taxo Finder, we propose a method for measuring associative strengths among the concepts, which quantify how strongly they are associated in the graph, using similarities between sentences and spatial distances between sentences. Lastly, Taxo Finder induces taxonomy from the graph using a graph analytic algorithm. Taxo Finder aims to build taxonomy in such a way that it maximizes the overall associative strengths among the concepts in the graph to build taxonomy. We evaluate Taxo Finder using gold-standard evaluation on three different domains: emergency management for mass gatherings, autism research, and disease domains. In our evaluation, we compare Taxo Finder with a state-of-the-art subsumption method and show that TaxoFinder is an effective approach significantly outperforming the subsumption method.
81. Clustering Data Streams Based on Shared Density between Micro-Clusters
ABSTRACT
As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro-clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-cluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results.
82. NATERGM: A Model for Examining the Role of Nodal Attributes in Dynamic Social Media Networks
ABSTRACT
Social media networks are dynamic. As such, the order in which network ties develop is an important aspect of the network dynamics. This study proposes a novel dynamic network model, the Nodal Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network analysis. The proposed model focuses on how the nodal attributes of a network affect the order in which the network ties develop. Temporal patterns in social media networks are modelled based on the nodal attributes of individuals and the time information of network ties. Using social media data collected from a knowledge sharing community, empirical tests were conducted to evaluate the performance of the NATERGM on identifying the temporal patterns and predicting the characteristics of the future networks. Results showed that the NATERGM demonstrated an enhanced pattern testing capability and an increased prediction accuracy of network characteristics compared to benchmark models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation process of dynamic networks.
83. Quality-Aware Sub graph Matching over Inconsistent Probabilistic Graph Databases
ABSTRACT
Resource Description Framework (RDF) has been widely used in the Semantic Web to describe resources and their relationships. The RDF graph is one of the most commonly used representations for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs integrated from different data sources may often contain uncertain and inconsistent information (e.g., uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality-aware sub graph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves sub graphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and with high quality scores (considering both consistency and uncertainty). In order to efficiently answer QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and quality score pruning, which can greatly filter out false alarms of sub graphs. We also design an effective index to facilitate our proposed pruning methods, and propose an efficient approach for processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our proposed approaches through extensive experiments.
84. Joint Structure Feature Exploration and Regularization for Multi-Task Graph Classification
ABSTRACT
Graph classification aims to learn models to classify structure data. To date, all existing graph classification methods are designed to target one single learning task and require a large number of labelled samples for learning good classification models. In reality, each real-world task may only have a limited number of labelled samples, yet multiple similar learning tasks can provide useful knowledge to benefit all tasks as a whole. In this paper, we formulate a new multi-task graph classification (MTG) problem, where multiple graph classification tasks are jointly regularized to find discriminative sub graphs shared by all tasks for learning. The niche of MTG stems from the fact that with a limited number of training samples, sub graph features selected for one single graph classification task tend to over fit the training data. By using additional tasks as evaluation sets, MTG can jointly regularize multiple tasks to explore high quality sub graph features for graph classification. To achieve this goal, we formulate an objective function which combines multiple graph classification tasks to evaluate the informativeness score of a sub graph feature. An iterative sub graph feature exploration and multi-task learning process is further proposed to incrementally select sub graph features for graph classification. Experiments on real-world multi-task graph classification datasets demonstrate significant performance gain.
85. Mining Health Examination Records — A Graph-based Approach
ABSTRACT
General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
86. Semantic-Aware Blocking for Entity Resolution
ABSTRACT
In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently unifies both textual and semantic features into an ER blocking process. In order to understand how similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces. In doing so, the proposed framework can support efficient similarity searches on records in both textual and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking techniques. Our experimental study shows that the combination of semantic similarity and textual similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking for performing entity resolution tasks in a large-scale data environment.
87. On Learning of Choice Models with Interactive Attributes
ABSTRACT
Introducing recent advances in the machine learning techniques to state-of-the-art discrete choice models, we develop an approach to infer the unique and complex decision making process of a decision-maker (DM), which is characterized by the DM’s priorities and attitudinal character, along with the attributes interaction, to name a few. On the basis of exemplary preference information in the form of pair wise comparisons of alternatives, our method seeks to induce a DM’s preference model in terms of the parameters of recent discrete choice models. To this end, we reduce our learning function to a constrained non-linear optimization problem. Our learning approach is a simple one that takes into consideration the interaction among the attributes along with the priorities and the unique attitudinal character of a DM. The experimental results on standard benchmark datasets suggest that our approach is not only intuitively appealing and easily interpretable but also competitive to state-of-the-art methods.
88. OMASS: One Memory Access Set Separation
ABSTRACT
In many applications, there is a need to identify to which of a group of sets an element $x$ belongs, if any. For example, in a router, this functionality can be used to determine the next hop of an incoming packet. This problem is generally known as set separation and has been widely studied. Most existing solutions make use of hash-based algorithms, particularly when a small percentage of false positives are allowed. A known approach is to use a collection of Bloom filters in parallel. Such schemes can require several memory accesses, a significant limitation for some implementations. We propose an approach using Block Bloom Filters, where each element is first hashed to a single memory block that stores a small Bloom filter that tracks the element and the set or sets the element belongs to. In a naïve solution, when an element $x$ in a set $S$ is stored, it necessarily increases the false positive probability for finding that $x$ is in another set $T$. In this paper, we introduce our One Memory Access Set Separation (OMASS) scheme to avoid this problem. OMASS is designed so that for a given element $x$, the corresponding Bloom filter bits for each set map to different positions in the memory word. This ensures that the false positive rates for the Bloom filters for element $x$ under other sets are not affected. In addition, OMASS requires fewer hash functions compared to the naïve solution.
89. Resolving Multi-party Privacy Conflicts in Social Media
ABSTRACT
Items shared through Social Media may affect more than one user’s privacy—e.g., photos that depict multiple users, comments that mention multiple users, events in which multiple users are invited, etc. The lack of multi-party privacy management support in current mainstream Social Media infrastructures makes users unable to appropriately control to whom these items are actually shared or not. Computational mechanisms that are able to merge the privacy preferences of multiple users into a single policy for an item can help solve this problem. However, merging multiple users’ privacy preferences is not an easy task, because privacy preferences may conflict, so methods to resolve conflicts are needed. Moreover, these methods need to consider how users’ would actually reach an agreement about a solution to the conflict in order to propose solutions that can be acceptable by all of the users affected by the item to be shared. Current approaches are either too demanding or only consider fixed ways of aggregating privacy preferences. In this paper, we propose the first computational mechanism to resolve conflicts for multi-party privacy management in Social Media that is able to adapt to different situations by modelling the concessions that users make to reach a solution to the conflicts. We also present results of a user study in which our proposed mechanism outperformed other existing approaches in terms of how many times each approach matched users’ behaviour.
90. SEDEX: Scalable Entity Preserving Data Exchange
ABSTRACT
Data exchange is the process of generating an instance of a target schema from an instance of a source schema such that source data is reflected in the target. Generally, data exchange is performed using schema mapping, representing high level relations between source and target schemas. In this paper, we argue that data exchange solely based on schema level information limits the ability to express semantics in data exchange. We show such schema level mappings not only may result in entity fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on data and schema mapping that employs similarities between relation trees of source and target relations to find the best relations that can host source instances. Our experiments show SEDEX outperforms other methods in terms of quality and scalability of data exchange.
91. Diplo Cloud: Efficient and Scalable Management of RDF Data in the Cloud
ABSTRACT
Despite recent advances in distributed RDF data management, processing large-amounts of RDF data in the cloud is still very challenging. In spite of its seemingly simple data model, RDF actually encodes rich and complex graphs mixing both instance and schema-level data. Sharding such data using classical techniques or partitioning the graph using traditional min-cut algorithms leads to very inefficient distributed operations and to a high number of joins. In this paper, we describe Diplo Cloud, an efficient and scalable distributed RDF data management system for the cloud. Contrary to previous approaches, Diplo Cloud runs a physiological analysis of both instance and schema information prior to partitioning the data. In this paper, we describe the architecture of Diplo Cloud, its main data structures, as well as the new algorithms we use to partition and distribute data. We also present an extensive evaluation of Diplo Cloud showing that our system is often two orders of magnitude faster than state-of-the-art systems on standard workloads.
92. A Survey on Trajectory Data Mining: Techniques and Applications
ABSTRACT
Rapid advance of location acquisition technologies boosts the generation of trajectory data, which track the traces of moving objects. A trajectory is typically represented by a sequence of time stamped geographical locations. A wide spectrum of applications can benefit from the trajectory data mining. Bringing unprecedented opportunities, large-scale trajectory data also pose great challenges. In this paper, we survey various applications of trajectory data mining, e.g., path discovery, location prediction, movement behaviour analysis, and so on. Furthermore, this paper reviews an extensive collection of existing trajectory data mining techniques and discusses them in a framework of trajectory data mining. This framework and the survey can be used as a guideline for designing future trajectory data mining solutions.
93. Insider Collusion Attack on Privacy-Preserving Kernel-Based Data Mining Systems
ABSTRACT
In this paper, we consider a new insider threat for the privacy preserving work of distributed kernel-based data mining (DKBDM), such as distributed support vector machine. Among several known data breaching problems, those associated with insider attacks have been rising significantly, making this one of the fastest growing types of security breaches. Once considered a negligible concern, insider attacks have risen to be one of the top three central data violations. Insider-related research involving the distribution of kernel-based data mining is limited, resulting in substantial vulnerabilities in designing protection against collaborative organizations. Prior works often fall short by addressing a multi factorial model that is more limited in scope and implementation than addressing insiders within an organization colluding with outsiders. A faulty system allows collusion to go unnoticed when an insider shares data with an outsider, who can then recover the original data from message transmissions (intermediary kernel values) among organizations. This attack requires only accessibility to a few data entries within the organizations rather than requiring the encrypted administrative privileges typically found in the distribution of data mining scenarios. To the best of our knowledge, we are the first to explore this new insider threat in DKBDM. We also analytically demonstrate the minimum amount of insider data necessary to launch the insider attack. Finally, we follow up by introducing several proposed privacy-preserving schemes to counter the described attack.
94. Probabilistic Static Load-Balancing of Parallel Mining of Frequent Sequences
ABSTRACT
Frequent sequence mining is well known and well-studied problem in data mining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to where is the number of processors. In the experimental evaluation, we show that our method performs significantly better than the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as item set/tree/graph mining algorithms.
95. FiDoop: Parallel Mining of Frequent Item sets Using Map Reduce
ABSTRACT
Existing parallel mining algorithms for frequent item sets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent item sets mining algorithm called FiDoop using the Map Reduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultra metric tree, rather than conventional FP trees. In FiDoop, three Map Reduce jobs are implemented to complete the mining task. In the crucial third Map Reduce job, the mappers independently decompose item sets; the reducers perform combination operations by constructing small ultra metric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because item sets with different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.
96. Fast and accurate mining the community structure: integrating center locating and membership optimization
ABSTRACT
Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many natural and engineering complex systems, e.g. protein networks, power grid, and transportation systems. Most of the existing techniques view the community mining problem as an optimization problem based on a given quality function (e.g., modularity), however none of them are grounded with a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining efficiency and the community quality still remains an open problem. In this paper, we attempt to address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable influence factor is proposed to measure the leadership of each node, those nodes with highest local leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system to describe the dynamical assignment of community membership; and formulate the servable conditions to guarantee the convergence of each node’s dynamic trajectory, by which the hierarchical community structure of the network can be revealed. The proposed dynamical system is independent of the quality function used, so could also be applied in other community mining models. Our algorithm is highly efficient: the computational complexity analysis shows that the execution time is nearly linearly dependent on the number of nodes in sparse networks. We finally give demonstrative applications of the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the algorithmic performance.