JAVA-IEEE PROJECTS
JAVA-IEEE PROJECTS
1.SecRBAC: Secure data in the Clouds
ABSTRACT
Most current security solutions are based on perimeter security. However, Cloud computing breaks the organization perimeters. When data resides in the Cloud, they reside outside the organizational bounds. This leads users to a loos of control over their data and raises reasonable security concerns that slow down the adoption of Cloud computing. Is the Cloud service provider accessing the data? Is it legitimately applying the access control policy defined by the user? This paper presents a data-centric access control solution with enriched role-based expressiveness in which security is focused on protecting user data regardless the Cloud service provider that holds it. Novel identity-based and proxy re-encryption techniques are used to protect the authorization model. Data is encrypted and authorization rules are cryptographically protected to preserve user data against the service provider access or misbehavior. The authorization model provides high expressiveness with role hierarchy and resource hierarchy support. The solution takes advantage of the logic formalism provided by Semantic Web technologies, which enables advanced rule management like semantic conflict detection. A proof of concept implementation has been developed and a working prototypical deployment of the proposal has been integrated within Google services.
2.Trust Agent-Based Behavior Induction in Social Networks
ABSTRACT
The essence of social networks is that they can influence people’s public opinions and group behaviors form quickly. Negative group behavior influences societal stability significantly, but existing behavior-induction approaches are too simple and inefficient. To automatically and efficiently induct behavior in social networks, this article introduces trust agents and designs their features according to group behavior features. In addition, a dynamics control mechanism can be generated to coordinate participant behaviors in social networks to avoid a specific restricted negative group behavior.
3.A Shoulder Surfing Resistant Graphical Authentication System
ABSTRACT
Authentication based on passwords is used largely in applications for computer security and privacy. However, human actions such as choosing bad passwords and inputting passwords in an insecure way are regarded as ”the weakest link” in the authentication chain. Rather than arbitrary alphanumeric strings, users tend to choose passwords either short or meaningful for easy memorization. With web applications and mobile apps piling up, people can access these applications anytime and anywhere with various devices. This evolution brings great convenience but also increases the probability of exposing passwords to shoulder surfing attacks. Attackers can observe directly or use external recording devices to collect users’ credentials. To overcome this problem, we proposed a novel authentication system PassMatrix, based on graphical passwords to resist shoulder surfing attacks. With a one-time valid login indicator and circulative horizontal and vertical bars covering the entire scope of pass-images, PassMatrix offers no hint for attackers to figure out or narrow down the password even they conduct multiple camera-based attacks. We also implemented a PassMatrix prototype on Android and carried out real user experiments to evaluate its memorability and usability. From the experimental result, the proposed system achieves better resistance to shoulder surfing attacks while maintaining usability
4.A Locality Sensitive Low-Rank Model for Image Tag Completion
ABSTRACT
Many visual applications have benefited from the outburst of web images, yet the imprecise and incomplete tags arbitrarily provided by users, as the thorn of the rose, may hamper the performance of retrieval or indexing systems relying on such data. In this paper, we propose a novel locality sensitive low-rank model for image tag completion, which approximates the global nonlinear model with a collection of local linear models. To effectively infuse the idea of locality sensitivity, a simple and effective pre-processing module is designed to learn suitable representation for data partition, and a global consensus regularizer is introduced to mitigate the risk of overfitting. Meanwhile, low-rank matrix factorization is employed as local models, where the local geometry structures are preserved for the low-dimensional representation of both tags and samples. Extensive empirical evaluations conducted on three datasets demonstrate the effectiveness and efficiency of the proposed method, where our method outperforms pervious ones by a large margin.
5.Quality-Aware Subgraph Matching Over Inconsistent Probabilistic Graph Databases
ABSTRACT
Resource Description Framework (RDF) has been widely used in the Semantic Web to describe resources and their relationships. The RDF graph is one of the most commonly used representations for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs integrated from different data sources may often contain uncertain and inconsistent information (e.g., uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality-aware subgraph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and with high quality scores (considering both consistency and uncertainty). In order to efficiently answer QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an effective index to facilitate our proposed pruning methods, and propose an efficient approach for processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our proposed approaches through extensive experiments.
6.Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search
ABSTRACT
With advances in geo-positioning technologies and geo-location services, there are a rapidly growing amount of spatio-textual objects collected in many applications such as location based services and social networks, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study two fundamental problems in the spatial keyword queries: top $k$ spatial keyword search (TOPK-SK), and batch top $k$ spatial keyword search (BTOPK-SK). Given a set of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the closest $k$ objects each of which contains all keywords in the query. BTOPK-SK is the batch processing of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient algorithm is then developed to tackle top $k$ spatial keyword sea- ch. To further enhance the filtering capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods.
7.Practical Approximate k Nearest Neighbor Queries with Location and Query Privacy
ABSTRACT
In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user. In this paper, we study approximate k nearest neighbor (kNN) queries where the mobile user queries the location-based service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current location. We propose a basic solution and a generic solution for the mobile user to preserve his location and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier public-key cryptosystem and can provide both location and query privacy. To preserve query privacy, our basic solution allows the mobile user to retrieve one type of POIs, for example, approximate k nearest car parks, without revealing to the LBS provider what type of points is retrieved. Our generic solution can be applied to multiple discrete type attributes of private location-based queries. Compared with existing solutions for kNN queries with location privacy, our solution is more efficient. Experiments have shown that our solution is practical for kNN queries.
8.Privacy Protection for Wireless Medical Sensor Data
ABSTRACT
In recent years, wireless sensor networks have been widely used in healthcare applications, such as hospital and home patient monitoring. Wireless medical sensor networks are more vulnerable to eavesdropping, modification, impersonation and replaying attacks than the wired networks. A lot of work has been done to secure wireless medical sensor networks. The existing solutions can protect the patient data during transmission, but cannot stop the inside attack where the administrator of the patient database reveals the sensitive patient data. In this paper, we propose a practical approach to prevent the inside attack by using multiple data servers to store patient data. The main contribution of this paper is securely distributing the patient data in multiple data servers and employing the Paillier and ElGamal cryptosystems to perform statistic analysis on the patient data without compromising the patients’ privacy.
9.Enabling Fine-Grained Multi-Keyword Search Supporting Classified Sub-Dictionaries over Encrypted Cloud Data
ABSTRACT
Using cloud computing, individuals can store their data on remote servers and allow data access to public users through the cloud servers. As the outsourced data are likely to contain sensitive privacy information, they are typically encrypted before uploaded to the cloud. This, however, significantly limits the usability of outsourced data due to the difficulty of searching over the encrypted data. In this paper, we address this issue by developing the fine-grained multi-keyword search schemes over encrypted cloud data. Our original contributions are three-fold. First, we introduce the relevance scores and preference factors upon keywords which enable the precise keyword search and personalized user experience. Second, we develop a practical and very efficient multi-keyword search scheme. The proposed scheme can support complicated logic search the mixed “AND”, “OR” and “NO” operations of keywords. Third, we further employ the classified sub-dictionaries technique to achieve better efficiency on index building, trapdoor generating and query. Lastly, we analyze the security of the proposed schemes in terms of confidentiality of documents, privacy protection of index and trapdoor, and unlinkability of trapdoor. Through extensive experiments using the real-world dataset, we validate the performance of the proposed schemes. Both the security analysis and experimental results demonstrate that the proposed schemes can achieve the same security level comparing to the existing ones and better performance in terms of functionality, query complexity and efficiency.
10.Leveraging Data Deduplication to Improve the Performance of Primary Storage Systems in the Cloud
ABSTRACT
With the explosive growth in data volume, the I/O bottleneck has become an increasingly daunting challenge for big data analytics in the Cloud. Recent studies have shown that moderate to high data redundancy clearly exists in primary storage systems in the Cloud. Our experimental studies reveal that data redundancy exhibits a much higher level of intensity on the I/O path than that on disks due to relatively high temporal access locality associated with small I/O requests to redundant data. Moreover, directly applying data deduplication to primary storage systems in the Cloud will likely cause space contention in memory and data fragmentation on disks. Based on these observations, we propose a performance-oriented I/O deduplication, called POD, rather than a capacity-oriented I/O deduplication, exemplified by iDedup, to improve the I/O performance of primary storage systems in the Cloud without sacrificing capacity savings of the latter. POD takes a two-pronged approach to improving the performance of primary storage systems and minimizing performance overhead of deduplication, namely, a request-based selective deduplication technique, called Select-Dedupe, to alleviate the data fragmentation and an adaptive memory management scheme, called iCache, to ease the memory contention between the bursty read traffic and the bursty write traffic. We have implemented a prototype of POD as a module in the Linux operating system. The experiments conducted on our lightweight prototype implementation of POD show that POD significantly outperforms iDedup in the I/O performance measure by up to 87.9 percent with an average of 58.8 percent. Moreover, our evaluation results also show that POD achieves comparable or better capacity savings than iDedup.
Download PDF
11. Providing Privacy-Aware Incentives in Mobile Sensing Systems
ABSTRACT
Mobile sensing relies on data contributed by users through their mobile device (e.g., smart phone) to obtain useful information about people and their surroundings. However, users may not want to contribute due to lack of incentives and concerns on possible privacy leakage. To effectively promote user participation, both incentive and privacy issues should be addressed. Although incentive and privacy have been addressed separately in mobile sensing, it is still an open problem to address them simultaneously. In this paper, we propose two credit-based privacy-aware incentive schemes for mobile sensing systems, where the focus is on privacy protection instead of on the design of incentive mechanisms. Our schemes enable mobile users to earn credits by contributing data without leaking which data they have contributed, and ensure that malicious users cannot abuse the system to earn unlimited credits. Specifically, the first scheme considers scenarios where an online trusted third party (TTP) is available, and relies on the TTP to protect user privacy and prevent abuse attacks. The second scheme considers scenarios where no online TTP is available. It applies blind signature, partially blind signature, and a novel extended Merkle tree technique to protect user privacy and prevent abuse attacks. Security analysis and cost evaluations show that our schemes are secure and efficient.
Download PDF
12. A Simple Message-Optimal Algorithm for Random Sampling from a Distributed Stream
ABSTRACT
We present a simple, message-optimal algorithm for maintaining a random sample from a large data stream whose input elements are distributed across multiple sites that communicate via a central coordinator. At any point in time, the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system. We present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. We also consider the important case when the distribution of elements across different sites is non-uniform, and show that for such inputs, our algorithm significantly outperforms prior solutions.
Download PDF
13. Multi-Grained Block Management to Enhance the Space Utilization of File Systems on PCM Storages
ABSTRACT
Phase-change memory (PCM) is a promising candidate as a storage medium to resolve the performance gap between main memory and storage in battery-powered mobile computing systems. However, it is more expensive than flash memory, and thus introduces a more serious storage capacity issue for low-cost solutions. This issue is further exacerbated by the fact that existing file systems are usually designed to trade space utilization for performance over block-oriented storage devices. In this work, we propose a multi-grained block management strategy to improve the space utilization of file systems over PCM-based storage systems. By utilizing the byte-addressability and fast read/write feature of PCM, a methodology is proposed to dynamically allocate multiple sizes of blocks to fit the size of each file, so as to resolve the space fragmentation issue with minimized space and management overheads. The space utilization of file systems is analyzed with consideration of block sizes. A series of experiments was conducted to evaluate the efficacy of the proposed strategy, and the results show that the proposed strategy can significantly improve the space utilization of file systems.
Download PDF
14. Resource-Saving File Management Scheme for Online Video Provisioning on Content Delivery Networks
ABSTRACT
Content delivery networks (CDNs) have been widely implemented to provide scalable cloud services. Such networks support resource pooling by allowing virtual machines or physical servers to be dynamically activated and deactivated according to current user demand. This paper examines online video replication and placement problems in CDNs. An effective video provisioning scheme must simultaneously (i) utilize system resources to reduce total energy consumption and (ii) limit replication overhead. We propose a scheme called adaptive data placement (ADP) that can dynamically place and reorganize video replicas among cache servers on subscribers’ arrival and departure. Both the analyses and simulation results show that ADP can reduce the number of activated cache servers with limited replication overhead. In addition, ADP’s performance is approximate to the optimal solution.
Download PDF
15. Inference Attack on Browsing History of Twitter Users Using Public Click Analytics and Twitter Metadata
ABSTRACT
Twitter is a popular online social network service for sharing short messages (tweets) among friends. Its users frequently use URL shortening services that provide (i) a short alias of a long URL for sharing it via tweets and (ii) public click analytics of shortened URLs. The public click analytics is provided in an aggregated form to preserve the privacy of individual users. In this paper, we propose practical attack techniques inferring who clicks which shortened URLs on Twitter using the combination of public information: Twitter metadata and public click analytics. Unlike the conventional browser history stealing attacks, our attacks only demand publicly available information provided by Twitter and URL shortening services. Evaluation results show that our attack can compromise Twitter users’ privacy with high accuracy.
16. Clustering Data Streams Based on Shared Density between Micro-Clusters
ABSTRACT
As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro-clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to re-cluster the micro-clusters into larger final clusters. For re-clustering, the centers of the micro-clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and re-clustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DB_STREAM, the first micro-cluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for re-clustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results.
17. A Modified Hierarchical Attribute-based Encryption Access Control Method for Mobile Cloud Computing
ABSTRACT
Cloudcomputing is an Internet-basedcomputing pattern through which shared resources are provided to devices ondemand. Its an emerging but promising paradigm to integrating mobile devices into cloudcomputing, and the integration performs in the cloudbasedhierarchical multi-user data-shared environment. With integrating into cloudcomputing, security issues such as data confidentiality and user authority may arise in the mobilecloudcomputing system, and it is concerned as the main constraints to the developments of mobilecloudcomputing. In order to provide safe and secure operation, a hierarchicalaccesscontrolmethod using modifiedhierarchicalattribute-basedencryption (M-HABE) and a modified three-layer structure is proposed in this paper. In a specific mobilecloudcomputing model, enormous data which may be from all kinds of mobile devices, such as smart phones, functioned phones and PDAs and so on can be controlled and monitored by the system, and the data can be sensitive to unauthorized third party and constraint to legal users as well. The novel scheme mainly focuses on the data processing, storing and accessing, which is designed to ensure the users with legal authorities to get corresponding classified data and to restrict illegal users and unauthorized legal users get access to the data, which makes it extremely suitable for the mobilecloudcomputing paradigms.
18.Using Crowdsourcing to Provide QoS for Mobile Cloud Computing
ABSTRACT
Quality of cloud service (QoS) is one of the crucial factors for the success of cloud providers in mobilecloudcomputing. Context-awareness is a popular method for automatic awareness of the mobile environment and choosing the most suitable cloud provider. Lack of context information may harm the users’ confidence in the application rendering it useless. Thus, mobile devices need to be constantly aware of the environment and to test the performance of each cloud provider, which is inefficient and wastes energy. Crowdsourcing is a considerable technology to discover and select cloud services in order toprovide intelligent, efficient, and stable discovering of services for mobile users based on group choice. This article introduces a crowdsourcing-based QoS supported mobilecloud service framework that fulfilsmobile users’ satisfaction by sensing their context information and providing appropriate services to each of the users. Based on user’s activity context, social context, service context, and device context, our framework dynamically adapts cloud service for the requests in different kinds of scenarios. The context-awareness based management approach efficiency achieves a reliable cloud service supported platform to supply the Quality of Service on mobile device.
19.Towards Achieving Data Security with the Cloud Computing Adoption Framework
ABSTRACT
Offering real-time datasecurity for petabytes of data is important for cloudcomputing. A recent survey on cloudsecurity states that the security of users’ data has the highest priority as well as concern. We believe this can only be able to achieve with an approach that is systematic, adoptable and well-structured. Therefore, this paper has developed a framework known as CloudComputingAdoptionFramework (CCAF) which has been customized for securing clouddata. This paper explains the overview, rationale and components in the CCAF to protect datasecurity. CCAF is illustrated by the system design based on the requirements and the implementation demonstrated by the CCAF multi-layered security. Since our DataCenter has 10 petabytes of data, there is a huge task to provide real-time protection and quarantine. We use Business Process Modeling Notation (BPMN) to simulate how data is in use. The use of BPMN simulation allows us to evaluate the chosen security performances before actual implementation. Results show that the time to take control of security breach can take between 50 and 125 hours. This means that additional security is required to ensure all data is well-protected in the crucial 125 hours. This paper has also demonstrated that CCAF multi-layered security can protect data in real-time and it has three layers of security: 1) firewall and access control; 2) identity management and intrusion prevention and 3) convergent encryption. To validate CCAF, this paper has undertaken two sets of ethical-hacking experiments involved with penetration testing with 10,000 trojans and viruses. The CCAF multi-layered security can block 9,919 viruses and trojans which can be destroyed in seconds and the remaining ones can be quarantined or isolated. The experiments show although the percentage of blocking can decrease for continuous injection of viruses and trojans, 97.43 percent of them can be quarantined. Our CCAF multi-layered security has an average of 20 percent b- tter performance than the single-layered approach which could only block 7,438 viruses and trojans. CCAF can be more effective when combined with BPMN simulation to evaluate security process and penetrating testing results.
20.A Combinatorial Auction mechanism for multiple resource procurement in cloud computing
ABSTRACT
Multipleresourceprocurement from several cloud vendors participating in bidding is addressed in this paper. This is done by assigning dynamic pricing for these resources. Since we consider multipleresources to be procured from several cloud vendors bidding in an auction, the problem turns out to be one of a combinatorialauction. We pre-process the user requests, analyze the auction and declare a set of vendors bidding for the auction as winners based on the CombinatorialAuction Branch on Bids (CABOB) model. Simulations using our approach with prices procured from several cloud vendors’ datasets show its effectiveness in multipleresourceprocurement in the realm of cloudcomputing.
21.Online Resource Scheduling Under Concave Pricing for Cloud Computing
ABSTRACT
With the booming cloudcomputing industry, computational resources are readily and elastically available to the customers. In order to attract customers with various demands, most Infrastructure-as-a-service (IaaS) cloud service providers offer several pricing strategies such as pay as you go, pay less per unit when you use more (so called volume discount), and pay even less when you reserve. The diverse pricing schemes among different IaaS service providers or even in the same provider form a complex economic landscape that nurtures the market of cloud brokers. By strategically scheduling multiple customers’ resource requests, a cloud broker can fully take advantage of the discounts offered by cloud service providers. In this paper, we focus on how a broker can help a group of customers to fully utilize the volume discount pricing strategy offered by cloud service providers through cost-efficient onlineresourcescheduling. We present a randomized online stack-centric scheduling algorithm (ROSA) and theoretically prove the lower bound of its competitive ratio. Three special cases of the offline concave cost scheduling problem and the corresponding optimal algorithms are introduced. Our simulation shows that ROSA achieves a competitive ratio close to the theoretical lower bound under the special cases. Trace-driven simulation using Google cluster data demonstrates that ROSA is superior to the conventional onlinescheduling algorithms in terms of cost saving.
22.A Survey of Proxy Re-Encryption for Secure Data Sharing in Cloud Computing
ABSTRACT
Never before have datasharing been more convenient with the rapid development and wide adoption of cloudcomputing. However, how to ensure the cloud user’s data security is becoming the main obstacles that hinder cloudcomputing from extensive adoption. Proxyre–encryption serves as a promising solution to secure the datasharing in the cloudcomputing. It enables a data owner to encrypt shareddata in cloud under its own public key, which is further transformed by a semi trustedcloud server into an encryption intended for the legitimate recipient for access control. This paper gives a solid and inspiring survey of proxyre–encryption from different perspectives to offer a better understanding of this primitive. In particular, we reviewed the state-of-the-art of the proxyre–encryption by investigating the design philosophy, examining the security models and comparing the efficiency and security proofs of existing schemes. Furthermore, the potential applications and extensions of proxyre–encryption have also been discussed. Finally, this paper is concluded with a summary of the possible future work.
23.Attribute-based Access Control with Constant-size Ciphertext in Cloud Computing
Attribute-based Access Control with Constant-size Ciphertext in Cloud Computing
ABSTRACT
With the popularity of cloudcomputing, there have been increasing concerns about its security and privacy. Since the cloudcomputing environment is distributed and untrusted, data owners have to encrypt outsourced data to enforce confidentiality. Therefore, how to achieve practicable accesscontrol of encrypted data in an untrusted environment is an urgent issue that needs to be solved. Attribute–Based Encryption (ABE) is a promising scheme suitable for accesscontrol in cloud storage systems. This paper proposes a hierarchical attribute–basedaccesscontrol scheme with constant–sizeciphertext. The scheme is efficient because the length of ciphertext and the number of bilinear pairing evaluations to a constant are fixed. Its computation cost in encryption and decryption algorithms is low. Moreover, the hierarchical authorization structure of our scheme reduces the burden and risk of a single authority scenario. We prove the scheme is of CCA2 security under the decisional q-Bilinear Diffie-Hellman Exponent assumption. In addition, we implement our scheme and analyse its performance. The analysis results show the proposed scheme is efficient, scalable, and fine-grained in dealing with accesscontrol for outsourced data in cloudcomputing.
24.A Context-Aware Architecture Supporting Service Availability in Mobile Cloud Computing
ABSTRACT
Mobile systems are gaining more and more importance, and new promising paradigms like MobileCloudComputing are emerging. MobileCloudComputing provides an infrastructure where data storage and processing could happen outside the mobile node. Specifically, there is a major interest in the use of the services obtained by taking advantage of the distributed resource pooling provided by nearby mobile nodes in a transparent way. This kind of systems is useful in application domains such as emergencies, education and tourism. However, these systems are commonly based on dynamic network topologies, in which disconnections and network partitions can occur frequently, and thus the availability of the services is usually compromised. Techniques and methods from Autonomic Computing can be applied to MobileCloudComputing to build dependable service models taking into account changes in the context. In this work, a context–aware software architecture is proposed to support the availability of the services deployed in mobile and dynamic network environments. The proposal is based on a service replication scheme together with a self-configuration approach for the activation/hibernation of the replicas of the service depending on relevant context information from the mobile system. To that end, an election algorithm has been designed and implemented.
25.Flexible and Fine-Grained Attribute-Based Data Storage in Cloud Computing
ABSTRACT
With the development of cloudcomputing, outsourcing data to cloud server attracts lots of attentions. To guarantee the security and achieve flexibly fine–grained file access control, attributebased encryption (ABE) was proposed and used in cloudstorage system. However, user revocation is the primary issue in ABE schemes. In this article, we provide a ciphertext-policy attributebased encryption (CP-ABE) scheme with efficient user revocation for cloudstorage system. The issue of user revocation can be solved efficiently by introducing the concept of user group. When any user leaves, the group manager will update users’ private keys except for those who have been revoked. Additionally, CP-ABE scheme has heavy computation cost, as it grows linearly with the complexity for the access structure. To reduce the computation cost, we outsource high computation load to cloud service providers without leaking file content and secret keys. Notbaly, our scheme can withstand collusion attack performed by revoked users cooperating with existing users. We prove the security of our scheme under the divisible computation Diffie-Hellman (DCDH) assumption. The result of our experiment shows computation cost for local devices is relatively low and can be constant. Our scheme is suitable for resource constrained devices.
26.Fair Resource Allocation for Data-Intensive Computing in the Cloud
ABSTRACT
To address the computing challenge of ’big data’, a number of data–intensivecomputing frameworks (e.g., MapReduce, Dryad, Storm and Spark) have emerged and become popular. YARN is a de facto resource management platform that enables these frameworks running together in a shared system. However, we observe that, in cloudcomputing environment, the fairresourceallocation policy implemented in YARN is not suitable because of its memorylessresourceallocation fashion leading to violations of a number of good properties in shared computing systems. This paper attempts to address these problems for YARN. Both singlelevel and hierarchical resourceallocations are considered. For single-level resourceallocation, we propose a novel fairresourceallocation mechanism called Long-Term Resource Fairness (LTRF) for such computing. For hierarchical resourceallocation, we propose Hierarchical Long-Term Resource Fairness (H-LTRF) by extending LTRF. We show that both LTRF and H-LTRF can address these fairness problems of current resourceallocation policy and are thus suitable for cloudcomputing. Finally, we have developed LTYARN by implementing LTRF and H-LTRF in YARN, and our experiments show that it leads to a better resource fairness than existing fair schedulers of YARN.
27.Secure Data Sharing in Cloud Computing Using Revocable-Storage Identity-Based Encryption
Secure Data Sharing in Cloud Computing Using Revocable-Storage Identity-Based Encryption
ABSTRACT
Cloudcomputing is an Internet-basedcomputing pattern through which shared resources are provided to devices ondemand. Its an emerging but promising paradigm to integrating mobile devices into cloudcomputing, and the integration performs in the cloudbasedhierarchical multi-user data-shared environment. With integrating into cloudcomputing, security issues such as data confidentiality and user authority may arise in the mobilecloudcomputing system, and it is concerned as the main constraints to the developments of mobilecloudcomputing. In order to provide safe and secure operation, a hierarchicalaccesscontrolmethod using modifiedhierarchicalattribute-basedencryption (M-HABE) and a modified three-layer structure is proposed in this paper. In a specific mobilecloudcomputing model, enormous data which may be from all kinds of mobile devices, such as smart phones, functioned phones and PDAs and so on can be controlled and monitored by the system, and the data can be sensitive to unauthorized third party and constraint to legal users as well. The novel scheme mainly focuses on the data processing, storing and accessing, which is designed to ensure the users with legal authorities to get corresponding classified data and to restrict illegal users and unauthorized legal users get access to the data, which makes it extremely suitable for the mobilecloudcomputing paradigms.
28.Knowledge-Based Resource Allocation for Collaborative Simulation Development in a Multi-tenant Cloud Computing Environment
Secure Data Sharing in Cloud Computing Using Revocable-Storage Identity-Based Encryption
ABSTRACT
Cloudcomputing is an Internet-basedcomputing pattern through which shared resources are provided to devices ondemand. Its an emerging but promising paradigm to integrating mobile devices into cloudcomputing, and the integration performs in the cloudbasedhierarchical multi-user data-shared environment. With integrating into cloudcomputing, security issues such as data confidentiality and user authority may arise in the mobilecloudcomputing system, and it is concerned as the main constraints to the developments of mobilecloudcomputing. In order to provide safe and secure operation, a hierarchicalaccesscontrolmethod using modifiedhierarchicalattribute-basedencryption (M-HABE) and a modified three-layer structure is proposed in this paper. In a specific mobilecloudcomputing model, enormous data which may be from all kinds of mobile devices, such as smart phones, functioned phones and PDAs and so on can be controlled and monitored by the system, and the data can be sensitive to unauthorized third party and constraint to legal users as well. The novel scheme mainly focuses on the data processing, storing and accessing, which is designed to ensure the users with legal authorities to get corresponding classified data and to restrict illegal users and unauthorized legal users get access to the data, which makes it extremely suitable for the mobilecloudcomputing paradigms.
29.KSF-OABE: Outsourced Attribute-Based Encryption with Keyword Search Function for Cloud Storage
ABSTRACT
Cloud computing becomes increasingly popular for data owners to outsource their data to public cloud servers while allowing intended data users to retrieve these data stored in cloud. This kind of computing model brings challenges to the security and privacy of data stored in cloud. Attribute–basedencryption (ABE) technology has been used to design fine-grained access control system, which provides one good method to solve the security issues in cloud setting. However, the computation cost and cipher text size in most ABE schemes grow with the complexity of the access policy. Outsourced ABE (OABE) with fine-grained access control system can largely reduce the computation cost for users who want to access encrypted data stored in cloud by outsourcing the heavy computation to cloud service provider (CSP). However, as the amount of encrypted files stored in cloud is becoming very huge, which will hinder efficient query processing? To deal with above problem, we present a new cryptographic primitive called attribute–basedencryption scheme with outsourcing key-issuing and outsourcing decryption, which can implement keywordsearchfunction (KSF–OABE). The proposed KSF–OABE scheme is proved secure against chosen-plaintext attack (CPA). CSP performs partial decryption task delegated by data user without knowing anything about the plaintext. Moreover, the CSP can perform encrypted keywordsearch without knowing anything about the keywords embedded in trapdoor
30.A Trust Label System for Communicating Trust in Cloud Services
ABSTRACT
Cloud computing is rapidly changing the digital service landscape. A proliferation of Cloud providers has emerged, increasing the difficulty of consumer decisions. Trust issues have been identified as a factor holding back Cloud adoption. The risks and challenges inherent in the adoption of Cloudservices are well recognised in the computing literature. In conjunction with these risks, the relative novelty of the online environment as a context for the provision of business services can increase consumer perceptions of uncertainty. This uncertainty is worsened in a Cloud context due to the lack of transparency, from the consumer perspective, into the service types, operational conditions and the quality of service offered by the diverse providers. Previous approaches failed to provide an appropriate medium for communicatingtrust and trustworthiness in Clouds. A new strategy is required to improve consumer confidence and trust in Cloud providers. This paper presents the operationalisation of a trustlabelsystem designed to communicatetrust and trustworthiness in Cloudservices. We describe the technical details and implementation of the trustlabel components. Based on a use case scenario, an initial evaluation was carried out to test its operations and its usefulness for increasing consumer trust in Cloudservices.
31.Towards Trustworthy Multi-Cloud Services Communities: A Trust-based Hedonic Coalitional Game
ABSTRACT
The prominence of cloud computing led to unprecedented proliferation in the number of Web services deployed in cloud data centres. In parallel, servicecommunities have gained recently increasing interest due to their ability to facilitate discovery, composition, and resource scaling in large-scale services’ markets. The problem is that traditional community formation models may work well when all services reside in a single cloud but cannot support a multi–cloud environment. Particularly, these models overlook having malicious services that misbehave to illegally maximize their benefits and that arises from grouping together services owned by different providers. Besides, they rely on a centralized architecture whereby a central entity regulates the community formation; which contradicts with the distributed nature of cloud–basedservices. In this paper, we propose a three-fold solution that includes: trust establishment framework that is resilient to collusion attacks that occur to mislead trust results; bootstrapping mechanism that capitalizes on the endorsement concept in online social networks to assign initial trust values; and trust–basedhedoniccoalitionalgame that enables services to distributive form trustworthymulti–cloudcommunities. Experiments conducted on a real-life dataset demonstrate that our model minimizes the number of malicious services compared to three state-of-the-art cloud federations and servicecommunities’ models.
32.Cost Effective, Reliable and Secure Workflow Deployment over Federated Clouds
ABSTRACT
The significant growth in cloud computing has led to increasing number of cloud providers, each offering their service under different conditions – one might be more secure whilst another might be less expensive or more reliable. At the same time user applications have become more and more complex. Often, they consist of a diverse collection of software components, and need to handle variable workloads, which poses different requirements on the infrastructure. Therefore, many organisations are considering using a combination of different clouds to satisfy these needs. It raises, however, a non-trivial issue of how to select the best combination of clouds to meet the application requirements. This paper presents a novel algorithm to deploy workflow applications on federatedclouds. Firstly, we introduce an entropy-based method to quantify the most reliableworkflowdeployments. Secondly, we apply an extension of the Bell-LaPadula Multi-Level security model to address application security requirements. Finally, we optimise deployment in terms of its entropy and also its monetary cost, taking into account the cost of computing power, data storage and inter-cloud communication. We implemented our new approach and compared it against two existing scheduling algorithms: Extended Dynamic Constraint Algorithm (EDCA) and Extended Biobjective dynamic level scheduling (EBDLS). We show that our algorithm can find deployments that are of equivalent reliability but are less expensive and meet security requirements. We have validated our solution through a set of realistic scientific workflows, using well-known cloud simulation tools (WorkflowSim and DynamicCloudSim) and a realistic cloud based data analysis system (e-Science Central).
33.Protecting Your Right: Verifiable Attribute-Based Keyword Search with Fine-Grained Owner-Enforced Search Authorization in the Cloud
ABSTRACT
Search over encrypted data is a critically important enabling technique in cloud computing, where encryption-before-outsourcing is a fundamental solution to protecting user data privacy in the untrusted cloud server environment. Many secure search schemes have been focusing on the single-contributor scenario, where the outsourced dataset or the secure searchable index of the dataset are encrypted and managed by a single owner, typically based on symmetric cryptography. In this paper, we focus on a different yet more challenging scenario where the outsourced dataset can be contributed from multiple owners and are searchable by multiple users, i.e., multi-user multi-contributor case. Inspired by attribute-based encryption (ABE), we present the first attribute-basedkeywordsearch scheme with efficient user revocation (ABKS-UR) that enables scalable fine-grained (i.e., file-level) searchauthorization. Our scheme allows multiple owners to encrypt and outsource their data to the cloud server independently. Users can generate their own search capabilities without relying on an always online trusted authority. Fine-grainedsearchauthorization is also implemented by the owner-enforced access policy on the index of each file. Further, by incorporating proxy re-encryption and lazy re-encryption techniques, we are able to delegate heavy system update workload during user revocation to the resourceful semi-trusted cloud server. We formalize the security definition and prove the proposed ABKS-UR scheme selectively secure against chosen-keyword attack. To build confidence of data user in the proposed secure search system, we also design a search result verification scheme. Finally, performance evaluation shows the efficiency of our scheme.
34.Cloud workflow scheduling with deadlines and time slot availability
ABSTRACT
Allocating service capacities in cloud computing is based on the assumption that they are unlimited and can be used at any time. However, available service capacities change with workload and cannot satisfy users’ requests at any time from the cloud provider’s perspective because cloud services can be shared by multiple tasks. Cloud service providers provide available timeslots for new user’s requests based on available capacities. In this paper, we consider workflowscheduling with deadline and timeslotavailability in cloud computing. An iterated heuristic framework is presented for the problem under study which mainly consists of initial solution construction, improvement, and perturbation. Three initial solution construction strategies, two greedy- and fair-based improvement strategies and a perturbation strategy are proposed. Different strategies in the three phases result in several heuristics. Experimental results show that different initial solution and improvement strategies have different effects on solution qualities.
35.Circuit Ciphertext-Policy Attribute-Based Hybrid Encryption with Verifiable Delegation in Cloud Computing
ABSTRACT
In the cloud, for achieving access control and keeping data confidential, the data owners could adopt attribute–basedencryption to encrypt the stored data. Users with limited computing power are however more likely to delegate the mask of the decryption task to the cloud servers to reduce the computing cost. As a result, attribute–basedencryption with delegation emerges. Still, there are caveats and questions remaining in the previous relevant works. For instance, during the delegation, the cloud servers could tamper or replace the delegated ciphertext and respond a forged computing result with malicious intent. They may also cheat the eligible users by responding them that they are ineligible for the purpose of cost saving. Furthermore, during the encryption, the access policies may not be flexible enough as well. Since policy for general circuits enables to achieve the strongest form of access control, a construction for realizing circuitciphertext–policyattribute–basedhybridencryption with verifiabledelegation has been considered in our work. In such a system, combined with verifiable computation and encrypt-then-mac mechanism, the data confidentiality, the fine-grained access control and the correctness of the delegated computing results are well guaranteed at the same time. Besides, our scheme achieves security against chosen-plaintext attacks under the k-multilinear Decisional Diffie-Hellman assumption. Moreover, an extensive simulation campaign confirms the feasibility and efficiency of the proposed solution.
36.Joint Energy Minimization and Resource Allocation in C-RAN with Mobile Cloud
ABSTRACT
Cloud radio access network (C–RAN) has emerged as a potential candidate of the next generation access network technology to address the increasing mobile traffic, while mobilecloud computing (MCC) offers a prospective solution to the resource-limited mobile user in executing computation intensive tasks. Taking full advantages of above two cloud-based techniques, C–RAN with MCC are presented in this paper to enhance both performance and energy efficiencies. In particular, this paper studies the jointenergyminimization and resourceallocation in C–RAN with MCC under the time constraints of the given tasks. We first review the energy and time model of the computation and communication. Then, we formulate the jointenergyminimization into a non-convex optimization with the constraints of task executing time, transmitting power, computation capacity and fronthaul data rates. This non-convex optimization is then reformulated into an equivalent convex problem based on weighted minimum mean square error (WMMSE). The iterative algorithm is finally given to deal with the jointresourceallocation in C–RAN with mobilecloud. Simulation results confirm that the proposed energyminimization and resourceallocation solution can improve the system performance and save energy.
37.A Secure and Dynamic Multi-Keyword Ranked Search Scheme over Encrypted Cloud Data
ABSTRACT
Due to the increasing popularity of cloud computing, more and more data owners are motivated to outsource their data to cloud servers for great convenience and reduced cost in data management. However, sensitive data should be encrypted before outsourcing for privacy requirements, which obsoletes data utilization like keyword-based document retrieval. In this paper, we present a securemulti–keywordrankedsearchscheme over encryptedclouddata, which simultaneously supports dynamic update operations like deletion and insertion of documents. Specifically, the vector space model and the widely-used TF x IDF model are combined in the index construction and query generation. We construct a special tree-based index structure and propose a “Greedy Depth-first Search” algorithm to provide efficient multi–keywordrankedsearch. The securekNN algorithm is utilized to encrypt the index and query vectors, and meanwhile ensure accurate relevance score calculation between encrypted index and query vectors. In order to resist statistical attacks, phantom terms are added to the index vector for blinding search results. Due to the use of our special tree-based index structure, the proposed scheme can achieve sub-linear search time and deal with the deletion and insertion of documents flexibly. Extensive experiments are conducted to demonstrate the efficiency of the proposed scheme.
38.Probabilistic Optimization of Resource Distribution and Encryption for Data Storage in the Cloud
ABSTRACT
In this paper, we develop a decentralized probabilistic method for performance optimization of cloud services. We focus on Infrastructure-as-a-Service where the user is provided with the ability of configuring virtual resources on demand in order to satisfy specific computational requirements. This novel approach is strongly supported by a theoretical framework based on tail probabilities and sample complexity analysis. It allows not only the inclusion of performance metrics for the cloud but the incorporation of security metrics based on cryptographic algorithms for datastorage. To the best of the authors’ knowledge this is the first unified approach to provision performance and security on demand subject to the Service Level Agreement between the client and the cloud service provider. The quality of the service is guaranteed given certain values of accuracy and confidence. We present some experimental results using the Amazon Web Services, Amazon Elastic Compute Cloud service to validate our probabilisticoptimization method.
39.Collective Energy-Efficiency Approach to Data Center Networks Planning
ABSTRACT
Energyefficiency of datacenters (DCs) has become a major concern as DCs continue to grow large often hosting tens of thousands of servers or even hundreds of thousands of them. Clearly, such a volume of DCs implies scale of datacenternetwork (DCN) with a huge number of network nodes and links. The energy consumption of this communication network has skyrocketed and become the same league as computing servers’costs. With the ever-increasing amount of data that need to be stored and processed in DCs, DCN traffic continues to soar drawing increasingly more power. In particular, more than one-third of the total energy in DCs is consumed by communication links, switching and aggregation elements. In this paper, we concern the energyefficiency of datacenter explicitly taking into account both servers and DCN. To this end, we present VPTCA, as a collectiveenergy–efficiencyapproachtodatacenternetworkplanning, which deals with virtual machine(VM) placement and communication traffic configuration. VPTCA aims particularly to reduce the energy consumption of DCN by assigning interrelated VMs into the same server or pod, which effectively helps reduce the amount of transmission load. In the layer of traffic message, VPTCA optimally uses switch ports and link bandwidth to balance the load and avoid congestions, enabling DCN to increase its transmission capacity, and saving a significant amount of networkenergy. In our evaluation via NS-2 simulations, the performance of VPTCA is measured and compared with two well-known DCN management algorithms, Global First Fit and Elastic Tree. Based on our experimental results, VPTCA outperforms existing algorithms in providing DCN more transmission capacity with less energy consumption.
40.Middleware-oriented Deployment Automation for Cloud Applications
ABSTRACT
Fully automated provisioning and deployment of applications is one of the most essential prerequisites to make use of the benefits of Cloud computing in order to reduce the costs for managing applications. A huge variety of approaches, tools, and providers are available to automate the involved processes. The DevOps community, for instance, provides tooling and reusable artifacts to implement deploymentautomation in an applicationoriented manner. Platform-as-a-Service frameworks are available for the same purpose. In this work we systematically classify and characterize available deployment approaches independently from the underlying technology used. For motivation and evaluation purposes, we choose Web applications with different technology stacks and analyze their specific deployment requirements. Afterwards, we provision these applications using each of the identified types of deployment approaches in the Cloud to perform qualitative and quantitative measurements. Finally, we discuss the evaluation results and derive recommendations to decide which deployment approach to use based on the deployment requirements of an application. Our results show that deployment approaches can also be efficiently combined if there is no ‘best fit’ for a particular application.
41.Trust-but-Verify: Verifying Result Correctness of Outsourced Frequent Itemset Mining in Data-Mining-As-a-Service Paradigm
ABSTRACT
Cloud computing is popularizing the computing paradigm in which data is outsourced to a third-party service provider (server) for datamining. Outsourcing, however, raises a serious security issue: how can the client of weak computational power verify that the server returned correct miningresult? In this paper, we focus on the specific task of frequentitemsetmining. We consider the server that is potentially untrusted and tries to escape from verification by using its prior knowledge of the outsourceddata. We propose efficient probabilistic and deterministic verification approaches to check whether the server has returned correct and complete frequentitemsets. Our probabilistic approach can catch incorrect results with high probability, while our deterministic approach measures the resultcorrectness with 100 percent certainty. We also design efficient verification methods for both cases that the data and the mining setup are updated. We demonstrate the effectiveness and efficiency of our methods using an extensive set of empirical results on real datasets.
42.Providing User Security Guarantees in Public Infrastructure Clouds
ABSTRACT
The infrastructurecloud (IaaS) service model offers improved resource flexibility and availability, where tenants – insulated from the minutiae of hardware maintenance – rent computing resources to deploy and operate complex systems. Large-scale services running on IaaS platforms demonstrate the viability of this model; nevertheless, many organizations operating on sensitive data avoid migrating operations to IaaS platforms due to security concerns. In this paper, we describe a framework for data and operation security in IaaS, consisting of protocols for a trusted launch of virtual machines and domain-based storage protection. We continue with an extensive theoretical analysis with proofs about protocol resistance against attacks in the defined threat model. The protocols allow trust to be established by remotely attesting host platform configuration prior to launching guest virtual machines and ensure confidentiality of data in remote storage, with encryption keys maintained outside of the IaaS domain. Presented experimental results demonstrate the validity and efficiency of the proposed protocols. The framework prototype was implemented on a test bed operating a public electronic health record system, showing that the proposed protocols can be integrated into existing cloud environments.
43.Energy-efficient Adaptive Resource Management for Real-time Vehicular Cloud Services
ABSTRACT
Providing real–timecloudservices to Vehicular Clients (VCs) must cope with delay and delay-jitter issues. Fog computing is an emerging paradigm that aims at distributing small-size self-powered data centers (e.g., Fog nodes) between remote Clouds and VCs, in order to deliver data-dissemination real–timeservices to the connected VCs. Motivated by these considerations, in this paper, we propose and test an energy–efficientadaptiveresource scheduler for Networked Fog Centers (NetFCs). They operate at the edge of the vehicular network and are connected to the served VCs through Infrastructure-to-Vehicular (I2V) TCP/IP-based single-hop mobile links. The goal is to exploit the locally measured states of the TCP/IP connections, in order to maximize the overall communication-plus-computing energy efficiency, while meeting the application-induced hard QoS requirements on the minimum transmission rates, maximum delays and delay-jitters. The resulting energy–efficient scheduler jointly performs: (i) admission control of the input traffic to be processed by the NetFCs; (ii) minimum-energy dispatching of the admitted traffic; (iii) adaptive reconfiguration and consolidation of the Virtual Machines (VMs) hosted by the NetFCs; and, (iv) adaptive control of the traffic injected into the TCP/IP mobile connections. The salient features of the proposed scheduler are that: (i) it is adaptive and admits distributed and scalable implementation; and, (ii) it is capable to provide hard QoS guarantees, in terms of minimum/maximum instantaneous rates of the traffic delivered to the vehicular clients, instantaneous rate-jitters and total processing delays. Actual performance of the proposed scheduler in the presence of: (i) client mobility; (ii) wireless fading; and, (iii) reconfiguration and consolidation costs of the underlying NetFCs, is numerically tested and compared against the corresponding ones of some state-of-the-art schedulers, under both synthetically generated and measured – eal-world workload traces.
44.Cloud Service Reliability Enhancement via Virtual Machine Placement Optimization
ABSTRACT
With rapid adoption of the cloud computing model, many enterprises have begun deploying cloud-based services. Failures of virtualmachines (VMs) in clouds have caused serious quality assurance issues for those services. VM replication is a commonly used technique for enhancing the reliability of cloudservices. However, when determining the VM redundancy strategy for a specific service, many state-of-the-art methods ignore the huge network resource consumption issue that could be experienced when the service is in failure recovery mode. This paper proposes a redundant VM placementoptimization approach to enhancing the reliability of cloudservices. The approach employs three algorithms. The first algorithm selects an appropriate set of VM-hosting servers from a potentially large set of candidate host servers based upon the network topology. The second algorithm determines an optimal strategy to place the primary and backup VMs on the selected host servers with k-fault-tolerance assurance. Lastly, a heuristic is used to address the task-to-VM reassignment optimization problem, which is formulated as finding a maximum weight matching in bipartite graphs. The evaluation results show that the proposed approach outperforms four other representative methods in network resource consumption in the service recovery stage.
45.A Novel Statistical Cost Model and an Algorithm for Efficient Application Offloading to Clouds
ABSTRACT
This work presents a novelstatisticalcostmodel for applications that can be offloadedtocloud computing environments. The model constructs a tree structure, referred to as the execution dependency tree (EDT), to accurately represent various execution relations, or dependencies (e.g., sequential, parallel and conditional branching) among the application modules, along its different execution paths. Contrary to existing models that assume fixed average offloadingcosts, each module’s cost is modelled as a random variable described by its Cumulative Distribution Function (CDF) that is statistically estimated through application profiling. Using this model, we generalize the offloadingcost optimization functions to those that use more user tailored statistical measures such as cost percentiles. We employ these functions to propose an efficientoffloadingalgorithm based on a dynamic programming formulation. We also show that the proposed model can be used as an efficient tool for application analysis by developers to gain insights on the applications’ statistical performance under varying network conditions and users behaviours. Performance evaluation results show that the achieved mean absolute percentage error between the model-based estimated cost and the measured one for the application execution time can be as small as 5% for applications with sequential and branching module dependencies.
46.PacketCloud: A Cloudlet-Based Open Platform for In-Network Services
ABSTRACT
The Internet was designed with the end-to-end principle where the network layer provided merely the best-effort forwarding service. This design makes it challenging to add new services into the Internet infrastructure. However, as the Internet connectivity becomes a commodity, users and applications increasingly demand new in-networkservices. This paper proposes PacketCloud, a cloudlet–basedopenplatform to host in-networkservices. Different from standalone, specialized middleboxes, cloudlets can efficiently share a set of commodity servers among different services, and serve the network traffic in an elastic way. PacketCloud can help both Internet Service Providers (ISPs) and emerging application/content providers deploy their services at strategic network locations. We have implemented a proof-of-concept prototype of PacketCloud. PacketCloud introduces a small additional delay, and can scale well to handle high-throughput data traffic. We have evaluated PacketCloud in both a fully functional emulated environment, and the real Internet.
47.A Dynamical and Load-Balanced Flow Scheduling Approach for Big Data Centers in Clouds
ABSTRACT
Load-balancedflowscheduling for bigdatacenters in clouds, in which a large amount of data needs to be transferred frequently among thousands of interconnected servers, is a key and challenging issue. The OpenFlow is a promising solution to balance dataflows in a datacenter network through its programmatic traffic controller. Existing OpenFlow based scheduling schemes, however, statically set up routes only at the initialization stage of data transmissions, which suffers from dynamicalflow distribution and changing network states in datacenters and often results in poor system performance. In this paper, we propose a novel dynamicalload-balancedscheduling (DLBS) approach for maximizing the network throughput while balancing workload dynamically. We firstly formulate the DLBS problem, and then develop a set of efficient heuristic scheduling algorithms for the two typical OpenFlow network models, which balance dataflows time slot by time slot. Experimental results demonstrate that our DLBS approach significantly outperforms other representative load-balancedscheduling algorithms Round Robin and LOBUS; and the higher imbalance degree dataflows in datacenters exhibit, the more improvement our DLBS approach will bring to the datacenters
48.Feedback Autonomic Provisioning for Guaranteeing Performance in MapReduce Systems
ABSTRACT
Companies have a fast growing amounts of data to process and store, a data explosion is happening next to us. Currently one of the most common approaches to treat these vast data quantities are based on the MapReduce parallel programming paradigm. While its use is widespread in the industry, ensuring performance constraints, while at the same time minimizing costs, still provides considerable challenges. We propose a coarse grained control theoretical approach, based on techniques that have already proved their usefulness in the control community. We introduce the first algorithm to create dynamic models for Big Data MapReducesystems, running a concurrent workload. Furthermore, we identify two important control use cases: relaxed performance – minimal resource and strict performance. For the first case we develop two feedback control mechanism. A classical feedback controller and an evenbasedfeedback, that minimises the number of cluster reconfigurations as well. Moreover, to address strict performance requirements a feedforward predictive controller that efficiently suppresses the effects of large workload size variations is developed. All the controllers are validated online in a benchmark running in a real 60 node MapReduce cluster, using a data intensive Business Intelligence workload. Our experiments demonstrate the success of the control strategies employed in assuring service time constraints.
49.Effective Modelling Approach for IaaS Data Center Performance Analysis under Heterogeneous Workload
ABSTRACT
Heterogeneity prevails not only among physical machines but also among workloads in real IaaS Cloud datacenters (CDCs). The heterogeneity makes performancemodelling of large and complex IaaS CDCs even more challenging. This paper considers the scenario where the number of virtual CPUs requested by each customer job may be different. We propose a hierarchical stochastic modellingapproach applicable to IaaS CDC performanceanalysis under such a heterogeneousworkload. Numerical results obtained from the proposed analytic model are verified through discrete-event simulations under various system parameter settings.
50.An Energy-Efficient VM Prediction and Migration Framework for Overcommitted Clouds
ABSTRACT
We propose an integrated, energy–efficient, resource allocation framework for overcommittedclouds. The framework makes great energy savings by 1) minimizing Physical Machine (PM) overload occurrences via VM resource usage monitoring and prediction, and 2) reducing the number of active PMs via efficientVMmigration and placement. Using real Google data consisting of a 29-day traces collected from a cluster containing more than 12K PMs, we show that our proposed framework outperforms existing overload avoidance techniques and prior VMmigration strategies by reducing the number of unpredicted overloads, minimizing migration overhead, increasing resource utilization, and reducing cloudenergy consumption.
51.Identity-Based Encryption with Cloud Revocation Authority and Its Applications
ABSTRACT
Identity–basedencryption (IBE) is a public key cryptosystem and eliminates the demands of public key infrastructure (PKI) and certificate administration in conventional public key settings. Due to the absence of PKI, the revocation problem is a critical issue in IBE settings. Several revocable IBE schemes have been proposed regarding this issue. Quite recently, by embedding an outsourcing computation technique into IBE, Li et al. proposed a revocable IBE scheme with a key-update cloud service provider (KU-CSP). However, their scheme has two shortcomings. One is that the computation and communication costs are higher than previous revocable IBE schemes. The other shortcoming is lack of scalability in the sense that the KU-CSP must keep a secret value for each user. In the article, we propose a new revocable IBE scheme with a cloudrevocationauthority (CRA) to solve the two shortcomings, namely, the performance is significantly improved and the CRA holds only a system secret for all the users. For security analysis, we demonstrate that the proposed scheme is semantically secure under the decisional bilinear Diffie-Hellman (DBDH) assumption. Finally, we extend the proposed revocable IBE scheme to present a CRA-aided authentication scheme with period-limited privileges for managing a large number of various cloud services.
52.A Cloud Gaming System Based on User-Level Virtualization and Its Resource Scheduling
ABSTRACT
Many believe the future of gaming lies in the cloud, namely CloudGaming, which renders an interactive gaming application in the cloud and streams the scenes as a video sequence to the player over Internet. This paper proposes GCloud, a GPU/CPU hybrid cluster for cloudgamingbased on the user–levelvirtualization technology. Specially, we present a performance model to analyze the server-capacity and games‘ resource-consumptions, which categorizes games into two types: CPU-critical and memory-of-critical. Consequently, several scheduling strategies have been proposed to improve the resource-utilization and compared with others. Simulation tests show that both of the First-Fit-like and the Best-Fit-like strategies outperform the other(s); especially they are near optimal in the batch processing mode. Other test results indicate that GCloud is efficient: An off-the-shelf PC can support five high-end video-games run at the same time. In addition, the average per-frame processing delay is 8~19 ms under different image-resolutions, which outperforms other similar solutions.
53.Optimal Joint Scheduling and Cloud Offloading for Mobile Applications
Optimal Joint Scheduling and Cloud Offloading for Mobile Applications
ABSTRACT
Cloudoffloading is an indispensable solution to supporting computationally demanding applications on resource constrained mobile devices. In this paper, we introduce the concept of wireless aware jointscheduling and computation offloading (JSCO) for multicomponent applications, where an optimal decision is made on which components need to be offloaded as well as the scheduling order of these components. The JSCO approach allows for more degrees of freedom in the solution by moving away from a compiler predetermined scheduling order for the components towards a more wireless aware scheduling order. For some component dependency graph structures, the proposed algorithm can shorten execution times by parallel processing appropriate components in the mobile and cloud. We define a net utility that trades-off the energy saved by the mobile, subject to constraints on the communication delay, overall application execution time, and component precedence ordering. The linear optimization problem is solved using real data measurements obtained from running multi-component applications on an HTC smartphone and the Amazon EC2, using WiFi for cloudoffloading. The performance is further analyzed using various component dependency graph topologies and sizes. Results show that the energy saved increases with longer application runtime deadline, higher wireless rates, and smaller offload data sizes.
54.An Efficient Privacy-Preserving Ranked Keyword Search Method
ABSTRACT
Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacypreserving. Therefore it is essential to develop efficient and reliable ciphertextsearch techniques. One challenge is that the relationship between documents will be normally concealed in the process of encryption, which will lead to significant search accuracy performance degradation. Also the volume of data in data centers has experienced a dramatic growth. This will make it even more challenging to design ciphertextsearch schemes that can provide efficient and reliable online information retrieval on large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support more search semantics and also to meet the demand for fast ciphertextsearch within a big data environment. The proposed hierarchical approach clusters the documents based on the minimum relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on the maximum size of cluster is reached. In the search phase, this approach can reach a linear computational complexity against an exponential size increase of document collection. In order to verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The results show that with a sharp increase of documents in the dataset the search time of the proposed method increases linearly whereas the search time of the traditional method increases exponentially. Furthermore, the proposed method has an advantage over the traditional method in the rankprivacy and relevance of retrieved documents.
55.A Taxonomy of Job Scheduling on Distributed Computing Systems
ABSTRACT
Hundreds of papers on jobscheduling for distributedsystems are published every year and it becomes increasingly difficult to classify them. Our analysis revealed that half of these papers are barely cited. This paper presents a general taxonomy for scheduling problems and solutions in distributedsystems. This taxonomy was used to classify and make publicly available the classification of 109 scheduling problems and their solutions. These 109 problems were further clustered into ten groups based on the features of the taxonomy. The proposed taxonomy will facilitate researchers to build on prior art, increase new research visibility, and minimize redundant effort.
56.LazyCtrl: A Scalable Hybrid Network Control Plane Design for Cloud Data Centers
LazyCtrl: A Scalable Hybrid Network Control Plane Design for Cloud Data Centers
ABSTRACT
The advent of software defined networking enables flexible, reliable and feature-rich controlplanes for datacenternetworks. However, the tight coupling of centralized control and complete visibility leads to a wide range of issues among which scalability has risen to prominence due to the excessive workload on the central controller. By analyzing the traffic patterns from a couple of production datacenters, we observe that datacenter traffic is usually highly skewed and thus edge switches can be clustered into a set of communicationintensive groups according to traffic locality. Motivated by this observation, we present LazyCtrl, a novel hybridcontrolplanedesign for datacenternetworks where networkcontrol is carried out by distributed control mechanisms inside independent groups of switches while complemented with a global controller. LazyCtrl aims at bringing laziness to the global controller by dynamically devolving most of the control tasks to independent switch groups to process frequent intra-group events near the datapath while handling rare inter-group or other specified events by the controller. We implement LazyCtrl and build a prototype based on Open vSwitch and Floodlight. Tracedriven experiments on our prototype show that an effective switch grouping is easy to maintain in multi-tenant clouds and the central controller can be significantly shielded by staying “lazy”, with its workload reduced by up to 82%.
57.Ensemble: A Tool for Performance Modeling of Applications in Cloud Data Centers
ABSTRACT
We introduce Ensemble, a runtime framework and associated tools for building applicationperformancemodels on-the-fly. These dynamic performancemodels can be used to support complex, highly dimensional resource allocation, and/or what-if performance inquiry in modern heterogeneous environments, such as datacenters and Clouds. Ensemble combines simple, partially specified, and lower-dimensionality models to provide good initial approximations for higher dimensionality applicationperformancemodels. We evaluated Ensemble on industry-standard and scientific applications. The results show that Ensemble provides accurate, fast, and flexible performancemodels even in the presence of significant environment variability.
58.AutoElastic: Automatic Resource Elasticity for High Performance Applications in the Cloud
ABSTRACT
Elasticity is undoubtedly one of the most striking characteristics of cloud computing. Especially in the area of highperformance computing (HPC), elasticity can be used to execute irregular and CPU-intensive applications. However, the on- the-fly increase/decrease in resources is more widespread in Web systems, which have their own IaaS-level load balancer. Considering the HPC area, current approaches usually focus on batch jobs or assumptions such as previous knowledge of application phases, source code rewriting or the stop-reconfigure-and-go approach for elasticity. In this context, this article presents AutoElastic, a PaaS-level elasticity model for HPC in the cloud. Its differential approach consists of providing elasticity for highperformanceapplications without user intervention or source code modification. The scientific contributions of AutoElastic are twofold: (i) an Aging-based approach to resource allocation and deallocation actions to avoid unnecessary virtual machine (VM) reconfigurations (thrashing) and (ii) asynchronism in creating and terminating VMs in such a way that the application does not need to wait for completing these procedures. The prototype evaluation using OpenNebula middleware showed performance gains of up to 26 percent in the execution time of an application with the AutoElastic manager. Moreover, we obtained low intrusiveness for AutoElastic when reconfigurations do not occur.
59.Supporting Multi Data Stores Applications in Cloud Environments
ABSTRACT
The production of huge amount of data and the emergence of cloud computing have introduced new requirements for data management. Many applications need to interact with several heterogeneous datastores depending on the type of data they have to manage: traditional data types, documents, graph data from social networks, simple key-value data, etc. Interacting with heterogeneous data models via different APIs, and multiple datastoreapplications imposes challenging tasks to their developers. Indeed, programmers have to be familiar with different APIs. In addition, the execution of complex queries over heterogeneous data models cannot, currently, be achieved in a declarative way as it is used to be with mono-datastoreapplication, and therefore requires extra implementation efforts. Moreover, developers need to master and deal with the complex processes of cloud discovery, and application deployment and execution. In this paper we propose an integrated set of models, algorithms and tools aiming at alleviating developers task for developing, deploying and migrating multiple datastoresapplications in cloudenvironments. Our approach focuses mainly on three points. First, we provide a unifying data model used by applications developers to interact with heterogeneous relational and NoSQLdatastores. Based on that, they express queries using OPEN-PaaS-DataBase API (ODBAPI), a unique REST API allowing programmers to write their applications code independently of the target datastores. Second, we propose virtual datastores, which act as a mediator and interact with integrated datastores wrapped by ODBAPI. This run-time component supports the execution of single and complex queries over heterogeneous datastores. Finally, we present a declarative approach that enables to lighten the burden of the tedious and non-standard tasks of (1) discovering relevant cloudenvironment and (2) deploying applications on them while letting developers to simply focus on specifying th- ir storage and computing requirements. A prototype of the proposed solution has been developed and is currently used to implement use cases from the OpenPaaS project.
60.Coral: A Cloud-Backed Frugal File System
ABSTRACT
With simple access interfaces and flexible billing models, cloud storage has become an attractive solution to simplify the storage management for both enterprises and individual users. However, traditional filesystems with extensive optimizations for local disk-based storage backend can not fully exploit the inherent features of the cloud to obtain desirable performance. In this paper, we present the design, implementation, and evaluation of Coral, a cloud based filesystem that strikes a balance between performance and monetary cost. Unlike previous studies that treat cloud storage as just a normal backend of existing networked filesystems, Coral is designed to address several key issues in optimizing cloud-based filesystems such as the data layout, block management, and billing model. With carefully designed data structures and algorithms, such as identifying semantically correlated data blocks, kd-tree based caching policy with self-adaptive thrashing prevention, effective data layout, and optimal garbage collection, Coral achieves good performance and cost savings under various workloads as demonstrated by extensive evaluations.
61.Dynamic and Public Auditing with Fair Arbitration for Cloud Data
ABSTRACT
Cloud users no longer physically possess their data, so how to ensure the integrity of their outsourced data becomes a challenging task. Recently proposed schemes such as “provable data possession” and “proofs of retrievability” are designed to address this problem, but they are designed to audit static archive data and therefore lack of datadynamics support. Moreover, threat models in these schemes usually assume an honest data owner and focus on detecting a dishonest cloud service provider despite the fact that clients may also misbehave. This paper proposes a publicauditing scheme with datadynamics support and fairness arbitration of potential disputes. In particular, we design an index switcher to eliminate the limitation of index usage in tag computation in current schemes and achieve efficient handling of datadynamics. To address the fairness problem so that no party can misbehave without being detected, we further extend existing threat models and adopt signature exchange idea to design fairarbitration protocols, so that any possible dispute can be fairly settled. The security analysis shows our scheme is provably secure, and the performance evaluation demonstrates the overhead of datadynamics and dispute arbitration are reasonable.
62.EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud
ABSTRACT
The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional SamplingAlgorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarityidentification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Samplingalgorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Theref- re, it is an effective approach of similarityidentification for the cloud.
.
63.TMACS: A Robust and Verifiable Threshold Multi-Authority Access Control System in Public Cloud Storage
ABSTRACT
Attribute-based Encryption (ABE) is regarded as a promising cryptographic conducting tool to guarantee data owners’ direct control over their data in publiccloudstorage. The earlier ABE schemes involve only one authority to maintain the whole attribute set, which can bring a single-point bottleneck on both security and performance. Subsequently, some multi–authority schemes are proposed, in which multiple authorities separately maintain disjoint attribute subsets. However, the single-point bottleneck problem remains unsolved. In this paper, from another perspective, we conduct a thresholdmulti–authority CP-ABE accesscontrol scheme for publiccloudstorage, named TMACS, in which multiple authorities jointly manage a uniform attribute set. In TMACS, taking advantage of ( ) threshold secret sharing, the master key can be shared among multiple authorities, and a legal user can generate his/her secret key by interacting with any authorities. Security and performance analysis results show that TMACS is not only verifiable secure when less than authorities are compromised, but also robust when no less than authorities are alive in the system. Furthermore, by efficiently combining the traditional multi–authority scheme with TMACS, we construct a hybrid one, which satisf- es the scenario of attributes coming from different authorities as well as achieving security and system-level robustness.
64.Risk Assessment in a Sensor Cloud Framework Using Attack Graphs
ABSTRACT
A sensorcloud consists of various heterogeneous wireless sensor networks (WSNs). These WSNs may have different owners and run a wide variety of user applications on demand in a wireless communication medium. Hence, they are susceptible to various security attacks. Thus, a need exists to formulate effective and efficient security measures that safeguard these applications impacted from attack in the sensorcloud. However, analyzing the impact of different attacks and their causeconsequence relationship is a prerequisite before security measures can be either developed or deployed. In this paper, we propose a riskassessmentframework for WSNs in a sensorcloud that utilizes attackgraphs. We use Bayesian networks to not only assess but also to analyzeattacks on WSNs. The riskassessmentframework will first review the impact of attacks on a WSN and estimate reasonable time frames that predict the degradation of WSN security parameters like confidentiality, integrity and availability. Using our proposed riskassessmentframework allows the security administrator to better understand the threats present and take necessary actions against them. The framework is validated by comparing the assessment results with that of the results obtained from different simulated attack scenarios.
65.RepCloud: Attesting to Cloud Service Dependency
ABSTRACT
Security enhancements to the emerging IaaS (Infrastructure as a Service) cloud computing systems have become the focus of much research, but little of this targets the underlying infrastructure. Trusted Cloud systems are proposed to integrate Trusted Computing infrastructure with cloud systems. With remote attestations, cloud customers are able to determine the genuine behaviors of their applications’ hosts; and therefore they establish trust to the cloud. However, the current Trusted Clouds have difficulties in effectively attestingto the cloudservicedependency for customers’ applications, due to the cloud’s complexity, heterogeneity and dynamism. In this paper, we present RepCloud, a decentralized cloud trust management framework, inspired by the reputation systems from the research in peerto- peer systems. With RepCloud, cloud customers are able to determine the properties of the exact nodes that may affect the genuine functionalities of their applications, without obtaining much internal information of the cloud. Experiments showed that besides achieving fine-grained cloudservicedependency attestation, RepCloud incurred lower trust management overhead than the existing trusted cloud systems.
66.Poris: A Scheduler for Parallel Soft Real-Time Applications in Virtualized Environments
ABSTRACT
With the prevalence of cloud computing and virtualization, more and more cloud services including parallelsoftreal–timeapplications (PSRT applications) are running in virtualized data centers. However, current hypervisors do not provide adequate support for them because of softreal–time constraints and synchronization problems, which result in frequent deadline misses and serious performance degradation. CPU schedulers in underlying hypervisors are central to these issues. In this paper, we identify and analyze CPU scheduling problems in hypervisors. Then, we design and implement a parallelsoftreal–timescheduler according to the analysis, named Poris, based on Xen. It addresses both softreal–time constraints and synchronization problems simultaneously. In our proposed method, priority promotion and dynamic time slice mechanisms are introduced to determine when to schedule virtual CPUs (VCPUs) according to the characteristics of softreal–timeapplications. Besides, considering that PSRT applications may run in a virtual machine (VM) or multiple VMs, we present parallel scheduling, group scheduling and communication-driven group scheduling to accelerate synchronizations of these applications and make sure that tasks are finished before their deadlines under different scenarios. Our evaluation shows Poris can significantly improve the performance of PSRT applications no matter how they run in a VM or multiple VMs. For example, compared to the Credit scheduler, Poris decreases the response time of web search benchmark by up to 91.6 percent.
67.Cost Minimization Algorithms for Data Center Management
ABSTRACT
Due to the increasing usage of cloud computing applications, it is important to minimize energy cost consumed by a datacenter, and simultaneously, to improve quality of service via datacentermanagement. One promising approach is to switch some servers in a datacenter to the idle mode for saving energy while to keep a suitable number of servers in the active mode for providing timely service. In this paper, we design both online and offline algorithms for this problem. For the offline algorithm, we formulate datacentermanagement as a costminimization problem by considering energy cost, delay cost (to measure service quality), and switching cost (to change servers’s active/idle mode). Then, we analyze certain properties of an optimal solution which lead to a dynamic programming based algorithm. Moreover, by revising the solution procedure, we successfully eliminate the recursive procedure and achieve an optimal offline algorithm with a polynomial complexity. For the online algorithm, We design it by considering the worst case scenario for future workload. In simulation, we show this online algorithm can always provide near-optimal solutions.
68.K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical and Experimental Analysis
ABSTRACT
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a computational intensive task with a large range of applications such as knowledge discovery or data mining. However, as the volume and the dimension of data increase, only distributed approaches can perform such costly operation in a reasonable time. Recent works have focused on implementing efficient solutions using the MapReduce programming model because it is suitable for distributed large scale data processing. Although these works provide different solutions to the same problem, each one has particular constraints and properties. In this paper, we compare the different existing approaches for computing kNN on MapReduce, first theoretically, and then by performing an extensive experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety of datasets, and analyze the impact of data volume, data dimension and the value of k from many perspectives like time and space complexity, and accuracy. The experimental part brings new advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this is the first paper that compares kNN computing methods on MapReduce both theoretically and experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based practical problems in the context of big data.
.
69.Efficient Algorithms for Mining Top-K High Utility Itemsets
ABSTRACT
High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However, setting min_util appropriately is a difficult problem for users. Generally speaking, finding an appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set too low, too many HUIs will be generated, which may cause the mining process to be very inefficient. On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we address the above issues by proposing a new framework for top-k high utility itemset mining, where k is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining such itemsets without the need to set min_util. We provide a structural comparison of the two algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal case of state-of-the-art utility mining algorithms.
70.Mining User-Aware Rare Sequential Topic Patterns in Document Streams
ABSTRACT
Textual documents created and distributed on the Internet are ever changing in various forms. Most of existing works are devoted to topic modelling and the evolution of individual topics, while sequential relations of topics in successive documents published by a specific user are ignored. In this paper, in order to characterize and detect personalized and abnormal behaviours of Internet users, we propose Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring on abnormal user behaviours. We present a group of algorithms to solve this innovative mining problem through three phases: pre-processing to extract probabilistic topics and identify sessions for different users, generating all the STP candidates with (expected) support values for each user by pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs. Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’ characteristics.
71.Pattern Based Sequence Classification
Pattern Based Sequence Classification
ABSTRACT
Sequence classification is an important task in data mining. We address the problem of sequence classification using rules composed of interesting patterns found in a dataset of labelled sequences and accompanying class labels. We measure the interestingness of a pattern in a given class of sequences by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. The first classifier is based on an improved version of the existing method of classification based on association rules, while the second ranks the rules by first measuring their value specific to the new data object. Experimental results show that our rule based classifiers outperform existing comparable classifiers in terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use different kinds of patterns as features to represent each sequence as a feature vector. We then apply a variety of machine learning algorithms for sequence classification, experimentally demonstrating that the patterns we discover represent the sequences well, and prove effective for the classification task.
72.ATD: Anomalous Topic Discovery in High Dimensional Discrete Data
ABSTRACT
We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit abnormal patterns. In many applications this can lead to better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques.
73.Crowdsourced Data Management: A Survey
ABSTRACT
Some important data management and analytics tasks cannot be completely addressed by automated processes. These “computer-hard” tasks such as entity resolution, sentiment analysis, and image recognition, can be enhanced through the use of human cognitive ability. Human Computation is an effective way to address such tasks by harnessing the capabilities of crowd workers (i.e., the crowd). Thus, crowdsourced data management has become an area of increasing interest in research and industry. There are three important problems in crowdsourced data management. (1) Quality Control: Workers may return noisy results and effective techniques are required to achieve high quality; (2) Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency Control: The human workers can be slow, particularly in contrast to computing time scales, so latency-control techniques are required. There has been significant work addressing these three factors for designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing plans of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies on crowdsourced data management. Based on this analysis we then outline key factors that need to be considered to improve crowdsourced data management.
74.A Survey of General-Purpose Crowdsourcing Techniques
ABSTRACT
Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving paradigm has gained a lot of attention and has been a hot research topic in the field of Computer Science. Even though a lot of work has been conducted on this topic, so far we do not have a comprehensive survey on most relevant work done in crowdsourcing field. In this paper, we aim to offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing. According to their focus, we divide this work into three parts, which are: incentive design, task assignment and quality control. For each part, we start with different problems faced in that area followed by a brief description of existing work and a discussion of pros and cons. In addition, we also present a real scenario on how the different techniques are used in implementing a location-based crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose crowdsourcing techniques and present some open problems in this area.
75.Mining Health Examination Records — A Graph-based Approach
ABSTRACT
General healthexamination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in healthexaminations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real healthexamination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
76.TopicSketch: Real-time Bursty Topic Detection from Twitter
ABSTRACT
Twitter has become one of the largest microblogging platforms for users around the world to share anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers a surge of relevant tweets within a short period of time, which often reflects important events of mass interest. How to leverage Twitter for early detection of bursty topics has therefore become an important research problem with immense practical value. Despite the wealth of research work on topic modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper TopicSketch, a sketch-based topic model together with a set of techniques to achieve real-time detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that TopicSketch on a single machine can potentially handle hundreds of millions tweets per day, which is on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-granularity.
77.SPIRIT: A Tree Kernel-based Method for Topic Person Interaction Detection
ABSTRACT
The development of a topic in a set of topic documents is constituted by a series of person interactions at a specific time and place. Knowing the interactions of the persons mentioned in these documents is helpful for readers to better comprehend the documents. In this paper, we propose a topic person interaction detection method called SPIRIT, which classifies the text segments in a set of topic documents that convey person interactions. We design the rich interactive tree structure to represent syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based convolution kernel to identify interactive segments. Experiment results based on real world topics demonstrate that the proposed rich interactive tree structure effectively detects the topic person interactions and that our method outperforms many well-known relation extraction and protein-protein interaction methods.
78.Truth Discovery in Crowdsourced Detection of Spatial Events
Truth Discovery in Crowdsourced Detection of Spatial Events
ABSTRACT
The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the detection of spatial events when smartphone users move around in their daily lives. However, the credibility of those detected events can be negatively impacted by unreliable participants with low-quality data. Consequently, a major challenge in mobile crowdsourcing is truth discovery, i.e., to discover true events from diverse and noisy participants’ reports. This problem is uniquely distinct from its online counterpart in that it involves uncertainties in both participants’ mobility and reliability. Decoupling these two types of uncertainties through location tracking will raise severe privacy and energy issues, whereas simply ignoring missing reports or treating them as negative reports will significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events (PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths of events, and three-way participant reliability in a unified framework. In PTSE, we further model personal location visit tendencies. These proposed models are capable of effectively handling various types of uncertainties and automatically discovering truths without any supervision or location tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed models outperform existing state-of-the-art truth discovery approaches in the mobile crowdsourcing environment.
79.Graph Regularized Feature Selection with Data Reconstruction
ABSTRACT
Feature selection is a challenging problem for high dimensional data processing, which arises in many real applications such as data mining, information retrieval, and pattern recognition. In this paper, we study the problem of unsupervised feature selection. The problem is challenging due to the lack of label information to guide feature selection. We formulate the problem of unsupervised feature selection from the viewpoint of graph regularized data reconstruction. The underlying idea is that the selected features not only preserve the local structure of the original data space via graph regularization, but also approximately reconstruct each data point via linear combination. Therefore, the graph regularized data reconstruction error becomes a natural criterion for measuring the quality of the selected features. By minimizing the reconstruction error, we are able to select the features that best preserve both the similarity and discriminant information in the original data. We then develop an efficient gradient algorithm to solve the corresponding optimization problem. We evaluate the performance of our proposed algorithm on text clustering. The extensive experiments demonstrate the effectiveness of our proposed approach.
80.Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks
ABSTRACT
The last few years have witnessed the emergence and evolution of a vibrant research stream on a large variety of online social media network (SMN) platforms. Recognizing anonymous, yet identical users among multiple SMNs is still an intractable problem. Clearly, cross-platform exploration may help solve many problems in social computing in both theory and applications. Since public profiles can be duplicated and easily impersonated by users with different purposes, most current user identification resolutions, which mainly focus on text mining of users’ public profiles, are fragile. Some studies have attempted to match users based on the location and timing of user content as well as writing style. However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from the short sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since online SMNs are quite symmetric, existing user identification schemes based on network structure are not effective. The real-world friend cycle is highly individual and virtually no two users share a congruent friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross-platform SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we proposed the Friend Relationship-Based User Identification (FRUI) algorithm. FRUI calculates a match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are considered as identical users. We also developed two propositions to improve the efficiency of the algorithm. Results of extensive experiments demonstrate that FRUI performs much better than current network structure-based algorithms.
81.TaxoFinder: A Graph-Based Approach for Taxonomy Learning
ABSTRACT
Taxonomy learning is an important task for knowledge acquisition, sharing, and classification as well as application development and utilization in various domains. To reduce human effort to build a taxonomy from scratch and improve the quality of the learned taxonomy, we propose a new taxonomy learning approach, named TaxoFinder. TaxoFinder takes three steps to automatically build a taxonomy. First, it identifies domain-specific concepts from a domain text corpus. Second, it builds a graph representing how such concepts are associated together based on their co-occurrences. As the key method in TaxoFinder, we propose a method for measuring associative strengths among the concepts, which quantify how strongly they are associated in the graph, using similarities between sentences and spatial distances between sentences. Lastly, TaxoFinder induces a taxonomy from the graph using a graph analytic algorithm. TaxoFinder aims to build a taxonomy in such a way that it maximizes the overall associative strengths among the concepts in the graph to build a taxonomy. We evaluate TaxoFinder using gold-standard evaluation on three different domains: emergency management for mass gatherings, autism research, and disease domains. In our evaluation, we compare TaxoFinder with a state-of-the-art subsumption method and show that TaxoFinder is an effective approach significantly outperforming the subsumption method.
82.NATERGM: A Model for Examining the Role of Nodal Attributes in Dynamic Social Media Networks
ABSTRACT
Social media networks are dynamic. As such, the order in which network ties develop is an important aspect of the network dynamics. This study proposes a novel dynamic network model, the Nodal Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network analysis. The proposed model focuses on how the nodal attributes of a network affect the order in which the network ties develop. Temporal patterns in social media networks are modeled based on the nodal attributes of individuals and the time information of network ties. Using social media data collected from a knowledge sharing community, empirical tests were conducted to evaluate the performance of the NATERGM on identifying the temporal patterns and predicting the characteristics of the future networks. Results showed that the NATERGM demonstrated an enhanced pattern testing capability and an increased prediction accuracy of network characteristics compared to benchmark models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation process of dynamic networks.
83.Joint Structure Feature Exploration and Regularization for Multi-Task Graph Classification
ABSTRACT
Graph classification aims to learn models to classify structure data. To date, all existing graph classification methods are designed to target one single learning task and require a large number of labeled samples for learning good classification models. In reality, each real-world task may only have a limited number of labeled samples, yet multiple similar learning tasks can provide useful knowledge to benefit all tasks as a whole. In this paper, we formulate a new multi-task graph classification (MTG) problem, where multiple graph classification tasks are jointly regularized to find discriminative subgraphs shared by all tasks for learning. The niche of MTG stems from the fact that with a limited number of training samples, subgraph features selected for one single graph classification task tend to overfit the training data. By using additional tasks as evaluation sets, MTG can jointly regularize multiple tasks to explore high quality subgraph features for graph classification. To achieve this goal, we formulate an objective function which combines multiple graph classification tasks to evaluate the informativeness score of a subgraph feature. An iterative subgraph feature exploration and multi-task learning process is further proposed to incrementally select subgraph features for graph classification. Experiments on real-world multi-task graph classification datasets demonstrate significant performance gain
84.Mining Health Examination Records — A Graph-based Approach
Mining Health Examination Records — A Graph-based Approach
ABSTRACT
General health examination is an integral part of healthcare in many countries. Identifying the participants at risk is important for early warning and preventive intervention. The fundamental challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.
85.Semantic-Aware Blocking for Entity Resolution
ABSTRACT
In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently unifies both textual and semantic features into an ER blocking process. In order to understand how similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces. In doing so, the proposed framework can support efficient similarity searches on records in both textual and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking techniques. Our experimental study shows that the combination of semantic similarity and textual similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking for performing entity resolution tasks in a large-scale data environment.
86.On Learning of Choice Models with Interactive Attributes
ABSTRACT
Introducing recent advances in the machine learning techniques to state-of-the-art discrete choice models, we develop an approach to infer the unique and complex decision making process of a decision-maker (DM), which is characterized by the DM’s priorities and attitudinal character, along with the attributes interaction, to name a few. On the basis of exemplary preference information in the form of pairwise comparisons of alternatives, our method seeks to induce a DM’s preference model in terms of the parameters of recent discrete choice models. To this end, we reduce our learning function to a constrained non-linear optimization problem. Our learning approach is a simple one that takes into consideration the interaction among the attributes along with the priorities and the unique attitudinal character of a DM. The experimental results on standard benchmark datasets suggest that our approach is not only intuitively appealing and easily interpretable but also competitive to state-of-the-art methods.
87.OMASS: One Memory Access Set Separation
ABSTRACT
In many applications, there is a need to identify to which of a group of sets an element $x$ belongs, if any. For example, in a router, this functionality can be used to determine the next hop of an incoming packet. This problem is generally known as set separation and has been widely studied. Most existing solutions make use of hash-based algorithms, particularly when a small percentage of false positives is allowed. A known approach is to use a collection of Bloom filters in parallel. Such schemes can require several memory accesses, a significant limitation for some implementations. We propose an approach using Block Bloom Filters, where each element is first hashed to a single memory block that stores a small Bloom filter that tracks the element and the set or sets the element belongs to. In a naïve solution, when an element $x$ in a set $S$ is stored, it necessarily increases the false positive probability for finding that $x$ is in another set $T$. In this paper, we introduce our One Memory Access Set Separation (OMASS) scheme to avoid this problem. OMASS is designed so that for a giv- n element $x$, the corresponding Bloom filter bits for each set map to different positions in the memory word. This ensures that the false positive rates for the Bloom filters for element $x$ under other sets are not affected. In addition, OMASS requires fewer hash functions compared to the naïve solution.
88.Resolving Multi-party Privacy Conflicts in Social Media
ABSTRACT
Items shared through Social Media may affect more than one user’s privacy—e.g., photos that depict multiple users, comments that mention multiple users, events in which multiple users are invited, etc. The lack of multi-party privacy management support in current mainstream Social Media infrastructures makes users unable to appropriately control to whom these items are actually shared or not. Computational mechanisms that are able to merge the privacy preferences of multiple users into a single policy for an item can help solve this problem. However, merging multiple users’ privacy preferences is not an easy task, because privacy preferences may conflict, so methods to resolve conflicts are needed. Moreover, these methods need to consider how users’ would actually reach an agreement about a solution to the conflict in order to propose solutions that can be acceptable by all of the users affected by the item to be shared. Current approaches are either too demanding or only consider fixed ways of aggregating privacy preferences. In this paper, we propose the first computational mechanism to resolve conflicts for multi-party privacy management in Social Media that is able to adapt to different situations by modelling the concessions that users make to reach a solution to the conflicts. We also present results of a user study in which our proposed mechanism outperformed other existing approaches in terms of how many times each approach matched users’ behaviour.
89.SEDEX: Scalable Entity Preserving Data Exchange
ABSTRACT
Data exchange is the process of generating an instance of a target schema from an instance of a source schema such that source data is reflected in the target. Generally, data exchnge is performed using schema mapping, representing high level relations between source and target schemas. In this paper, we argue that data exchange solely based on schema level information limits the ability to express semantics in data exchange. We show such schema level mappings not only may result in entity fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on data and schema mapping that employs similarities between relation trees of source and target relations to find the best relations that can host source instances. Our experiments show SEDEX outperforms other methods in terms of quality and scalability of data exchange
90.DiploCloud: Efficient and Scalable Management of RDF Data in the Cloud
ABSTRACT
Despite recent advances in distributed RDF data management, processing large-amounts of RDF data in the cloud is still very challenging. In spite of its seemingly simple data model, RDF actually encodes rich and complex graphs mixing both instance and schema-level data. Sharding such data using classical techniques or partitioning the graph using traditional min-cut algorithms leads to very inefficient distributed operations and to a high number of joins. In this paper, we describe DiploCloud, an efficient and scalable distributed RDF data management system for the cloud. Contrary to previous approaches, DiploCloud runs a physiological analysis of both instance and schema information prior to partitioning the data. In this paper, we describe the architecture of DiploCloud, its main data structures, as well as the new algorithms we use to partition and distribute data. We also present an extensive evaluation of DiploCloud showing that our system is often two orders of magnitude faster than state-of-the-art systems on standard workloads.
91.A Survey on Trajectory Data Mining: Techniques and Applications
ABSTRACT
Rapid advance of location acquisition technologies boosts the generation of trajectory data, which track the traces of moving objects. A trajectory is typically represented by a sequence of timestamped geographical locations. A wide spectrum of applications can benefit from the trajectory data mining. Bringing unprecedented opportunities, large-scale trajectory data also pose great challenges. In this paper, we survey various applications of trajectory data mining, e.g., path discovery, location prediction, movement behaviour analysis, and so on. Furthermore, this paper reviews an extensive collection of existing trajectory data mining techniques and discusses them in a framework of trajectory data mining. This framework and the survey can be used as a guideline for designing future trajectory data mining solutions.
92.Insider Collusion Attack on Privacy-Preserving Kernel-Based Data Mining Systems
ABSTRACT
In this paper, we consider a new insider threat for the privacy preserving work of distributed kernel-based data mining (DKBDM), such as distributed support vector machine. Among several known data breaching problems, those associated with insider attacks have been rising significantly, making this one of the fastest growing types of security breaches. Once considered a negligible concern, insider attacks have risen to be one of the top three central data violations. Insider-related research involving the distribution of kernel-based data mining is limited, resulting in substantial vulnerabilities in designing protection against collaborative organizations. Prior works often fall short by addressing a multifactorial model that is more limited in scope and implementation than addressing insiders within an organization colluding with outsiders. A faulty system allows collusion to go unnoticed when an insider shares data with an outsider, who can then recover the original data from message transmissions (intermediary kernel values) among organizations. This attack requires only accessibility to a few data entries within the organizations rather than requiring the encrypted administrative privileges typically found in the distribution of data mining scenarios. To the best of our knowledge, we are the first to explore this new insider threat in DKBDM. We also analytically demonstrate the minimum amount of insider data necessary to launch the insider attack. Finally, we follow up by introducing several proposed privacy-preserving schemes to counter the described attack.
93.Probabilistic Static Load-Balancing of Parallel Mining of Frequent Sequences
ABSTRACT
Frequent sequence mining is well known and well-studied problem in data mining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to where is the number of processors. In the experimental evaluation, we show that our method performs significantly better than the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms.
94.Clustering Data Streams Based on Shared Density between Micro-Clusters
ABSTRACT
As more and more applications produce streaming data, clustering data streams has become an important technique for data and knowledge engineering. A typical approach is to summarize the data stream in real-time with an online process into a large number of so called micro-clusters. Micro-clusters represent local density estimates by aggregating the information of many data points in a defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-clusters are used as pseudo points with the density estimates used as their weights. However, information about density in the area between micro-clusters is not preserved in the online process and reclustering is based on possibly inaccurate assumptions about the distribution of data within and between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-cluster-based online clustering component that explicitly captures the density between micro-clusters via a shared density graph. The density information in this graph is then exploited for reclustering based on actual density between adjacent micro-clusters. We discuss the space and time complexity of maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets highlight that using shared density improves clustering quality over other popular data stream clustering methods which require the creation of a larger number of smaller micro-clusters to achieve comparable results.
95.FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
ABSTRACT
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the MapReduce programming model. To achieve compressed storage and avoid building conditional pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial third MapReduce job, the mappers independently decompose itemsets, the reducers perform combination operations by constructing small ultrametric trees, and the actual mining of these trees separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data distribution and dimensions, because itemsets with different lengths have different decomposition and construction costs. To improve FiDoop’s performance, we develop a workload balance metric to measure load balance across the cluster’s computing nodes. We develop FiDoop-HD, an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis. Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution is efficient and scalable.
96.Fast and accurate mining the community structure: integrating center locating and membership optimization
ABSTRACT
Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many natural and engineering complex systems, e.g. protein networks, power grid, and transportation systems. Most of the existing techniques view the community mining problem as an optimization problem based on a given quality function (e.g., modularity), however none of them are grounded with a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining efficiency and the community quality still remains an open problem. In this paper, we attempt to address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable influence factor is proposed to measure the leadership of each node, those nodes with highest local leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system to describe the dynamical assignment of community membership; and formulate the serval conditions to guarantee the convergence of each node’s dynamic trajectory, by which the hierarchical community structure of the network can be revealed. The proposed dynamical system is independent of the quality function used, so could also be applied in other community mining models. Our algorithm is highly efficient: the computational complexity analysis shows that the execution time is nearly linearly dependent on the number of nodes in sparse networks. We finally give demonstrative applications of the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the algorithmic performance.
97.STAMP: Enabling Privacy-Preserving Location Proofs for Mobile Users
ABSTRACT
Location-based services are quickly becoming immensely popular. In addition to services based on users’ current location, many potential services rely on users’ location history, or their spatial-temporal provenance. Malicious users may lie about their spatial-temporal provenance without a carefully designed security system for users to prove their past locations. In this paper, we present the Spatial-Temporal provenance Assurance with Mutual Proofs (STAMP) scheme. STAMP is designed for ad-hoc mobile users generating location proofs for each other in a distributed setting. However, it can easily accommodate trusted mobile users and wireless access points. STAMP ensures the integrity and non-transferability of the location proofs and protects users’ privacy. A semi-trusted Certification Authority is used to distribute cryptographic keys as well as guard users against collusion by a light-weight entropy-based trust evaluation approach. Our prototype implementation on the Android platform shows that STAMP is low-cost in terms of computational and storage resources. Extensive simulation experiments
98.Optimal Resource Allocation Over Time and Degree Classes for Maximizing Information Dissemination in Social Networks
ABSTRACT
We study the optimal control problem of allocating campaigning resources over the campaign duration and degree classes in a social network. Information diffusion is modelled as a Susceptible-Infected epidemic and direct recruitment of susceptible nodes to the infected (informed) class is used as a strategy to accelerate the spread of information. We formulate an optimal control problem for optimizing a net reward function, a linear combination of the reward due to information spread and cost due to application of controls. The time varying resource allocation and seeds for the epidemic are jointly optimized. A problem variation includes a fixed budget constraint. We prove the existence of a solution for the optimal control problem, provide conditions for uniqueness of the solution, and prove some structural results for the controls (e.g., controls are non-increasing functions of time). The solution technique uses Pontryagin’s Maximum Principle and the forward-backward sweep algorithm (and its modifications) for numerical computations. Our formulations lead to large optimality systems with up to about 200 differential equations and allow us to study the effect of network topology (Erdos-Rényi/scale-free) on the controls. Results reveal that the allocation of campaigning resources to various degree classes depends not only on the network topology but also on system parameters such as cost/abundance of resources. The optimal strategies lead to significant gains over heuristic strategies for various model parameters. Our modelling approach assumes uncorrelated network, however, we find the approach useful for real networks as well. This work is useful in product advertising, political and crowd funding campaigns in social networks
99.Fast and Scalable Range Query Processing With Strong Privacy Protection for Cloud Computing
ABSTRACT
Privacy has been the key road block to cloud computing as clouds may not be fully trusted. This paper is concerned with the problem of privacy-preserving range query processing on clouds. Prior schemes are weak in privacy protection as they cannot achieve index indistinguishability, and therefore allow the cloud to statistically estimate the values of data and queries using domain knowledge and history query results. In this paper, we propose the first range query processing scheme that achieves index indistinguishability under the in distinguishability against chosen keyword attack (IND-CKA). Our key idea is to organize indexing elements in a complete binary tree called PBtree, which satisfies structure indistinguishability (i.e., two sets of data items have the same PBtree structure if and only if the two sets have the same number of data items) and node indistinguishability (i.e., the values of PBtree nodes are completely random and have no statistical meaning). We prove that our scheme is secure under the widely adopted IND-CKA security model. We propose two algorithms, namely PBtree traversal width minimization and PBtree traversal depth minimization, to improve query processing efficiency. We prove that the worst-case complexity of our query processing algorithm using PBtree is, where the total number of data items is and is the set of data items in the query result. We implemented and evaluated our scheme on a real-world dataset with 5 million items. For example, for a query whose results contain 10 data items, it takes only 0.17 ms.
100.Privacy Preserving Ranked Multi-Keyword Search for Multiple Data Owners in Cloud Computing
ABSTRACT
With the advent of cloud computing, it has become increasingly popular for data owners to outsource their data to public cloud servers while allowing data users to retrieve this data. For privacy concerns, secure searches over encrypted cloud data has motivated several research works under the single owner model. However, most cloud servers in practice do not just serve one owner; instead, they support multiple owners to share the benefits brought by cloud computing. In this paper, we propose schemes to deal with privacy preserving ranked multi-keyword search in a multi-owner model (PRMSM). To enable cloud servers to perform secure search without knowing the actual data of both keywords and trapdoors, we systematically construct a novel secure search protocol. To rank the search results and preserve the privacy of relevance scores between keywords and files, we propose a novel additive order and privacy preserving function family. To prevent the attackers from eavesdropping secret keys and pretending to be legal data users submitting searches, we propose a novel dynamic secret key generation protocol and a new data user authentication protocol. Furthermore, PRMSM supports efficient data user revocation. Extensive experiments on real-world datasets confirm the efficacy and efficiency of PRMSM.
101.Attribute-Based Data Sharing Scheme Revisited in Cloud Computing
ABSTRACT
Ciphertext-policy attribute-based encryption (CP-ABE) is a very promising encryption technique for secure data sharing in the context of cloud computing. Data owner is allowed to fully control the access policy associated with his data which to be shared. However, CP-ABE is limited to a potential security risk that is known as key escrow problem, whereby the secret keys of users have to be issued by a trusted key authority. Besides, most of the existing CP-ABE schemes cannot support attribute with arbitrary state. In this paper, we revisit attribute-based data sharing scheme in order to solve the key escrow issue but also improve the expressiveness of attribute, so that the resulting scheme is more friendly to cloud computing applications. We propose an improved two-party key issuing protocol that can guarantee that neither key authority nor cloud service provider can compromise the whole secret key of a user individually. Moreover, we introduce the concept of attribute with weight, being provided to enhance the expression of attribute, which can not only extend the expression from binary to arbitrary state, but also lighten the complexity of access policy. Therefore, both storage cost and encryption complexity for a ciphertext are relieved. The performance analysis and the security proof show that the proposed scheme is able to achieve efficient and secure data sharing in cloud computing.
102.Optimal Secrecy Capacity-Delay Tradeoff in Large-Scale Mobile Ad Hoc Networks
ABSTRACT
In this paper, we investigate the impact of information-theoretic secrecy constraint on the capacity and delay of mobile ad hoc networks (MANETs) with mobile legitimate nodes and static eavesdroppers whose location and channel state information (CSI) are both unknown. We assume n legitimate nodes move according to the fast i.i.d. mobility pattern and each desires to communicate with one randomly selected destination node. There are also nν static eavesdroppers located uniformly in the network and we assume the number of eavesdroppers is much larger than that of legitimate nodes, i.e., ν > 1. We propose a novel simple secure communication model, i.e., the secure protocol model, and prove its equivalence to the widely accepted secure physical model under a few technical assumptions. Based on the proposed model, a framework of analyzing the secrecy capacity and delay in MANETs is established. Given a delay constraint D, we find that the optimal secrecy throughput capacity is ~Θ(W((D/n))(2/3)), where W is the data rate of each link. We observe that: 1) the capacity-delay tradeoff is independent of the number of eavesdroppers, which indicates that adding more eavesdroppers will not degenerate the performance of the legitimate network as long as ν > 1; 2) the capacity-delay tradeoff of our paper outperforms the previous result Θ((1/nψe)) in , where ψe=nν-1=ω(1) is the density of the eavesdroppers. Throughout this paper, for functions f(n) and g(n), we denote f(n)=o(g(n)) if limn→∞(f(n)/g(n))=0; f(n)=ω(g(n)) if g(n)=o(f(n)); f(n)=O(g(n)) if there is a positive constant c such that f(n) ≤ cg(n) for sufficiently large n; f(n)=Ω(g(n)) if g(n)=O(f(n)); f(n)=Θ(g(n)) if both f(n)=O(g(n)) and f(n)=Ω(g(n)) hold. Besides, the order notation ~Θ omits the polylogarithmicfactors for better readability
103.BCCC: An Expandable Network for Data Centers
ABSTRACT
Designing a cost-effective network topology for data centers that can deliver sufficient bandwidth and consistent latency performance to a large number of servers has been an important and challenging problem. Many server-centric data center network topologies have been proposed recently due to their significant advantage in cost efficiency and data center agility, such as BCube, FiConn, and Bidimensional Compound Network (BCN). However, existing server-centric topologies are either not expandable or demanding prohibitive expansion cost. As the scale of data centers increases rapidly, the lack of expandability in existing server-centric data center networks imposes a severe obstacle for data center upgrade. In this paper, we present a novel server-centric data center network topology called BCube connected crossbars (BCCCs), which can provide good network performance using inexpensive commodity off-the-shelf switches and commodity servers with only two network interface card (NIC) ports. A significant advantage of BCCC is its good expandability. When there is a need for expansion, we can easily add new servers and switches into the existing BCCC with little alteration of the existing structure. Meanwhile, BCCC can accommodate a large number of servers while keeping a very small network diameter. A desirable property of BCCC is that its diameter increases only linearly to the network order (i.e., the number of dimensions), which is superior to most of the existing server-centric networks, such as FiConn and BCN, whose diameters increase exponentially with network order. In addition, there are a rich set of parallel paths with similar length between any pair of servers in BCCC, which enables BCCC to not only deliver sufficient bandwidth capacity and predictable latency to end hosts, but also provide graceful performance degradation in case of component failure. We conduct comprehensive comparisons between BCCC with other popular server-centric network topologies, suc- as FiConn and BCN. We also propose an effective addressing scheme and routing algorithms for BCCC. We show that BCCC has significant advantages over the existing server-centric topologies in many important metrics, such as expandability, server port utilization, and network diameter.
104.Identification of Boolean Networks Using Premined Network Topology Information
ABSTRACT
This brief aims to reduce the data requirement for the identification of Boolean networks (BNs) by using the premined network topology information. First, a matching table is created and used for sifting the true from the false dependences among the nodes in the BNs. Then, a dynamic extension to matching table is developed to enable the dynamic locating of matching pairs to start as soon as possible. Next, based on the pseudocommutative property of the semitensor product, a position-transform mining is carried out to further improve data utilization. Combining the above, the topology of the BNs can be premined for the subsequent identification. Examples are given to illustrate the efficiency of reducing the data requirement. Some excellent features, such as the online and parallel processing ability, are also demonstrated.
105.An Integrated Systematic Approach to Designing Enterprise Access Control
ABSTRACT
Today, the network design process remains ad hoc and largely complexity agnostic, often resulting in suboptimal networks characterized by excessive amounts of dependence and commands in device configurations. The unnecessary high configuration complexity can lead to a huge increase in both the amount of manual intervention required for managing the network and the likelihood of configuration errors, and thus must be avoided. In this paper, we present an integrated top-down design approach and show how it can minimize the unnecessary configuration complexity in realizing reachability-based access control, a key network design objective that involves designing three distinct network elements: virtual local-area network (VLAN), IP address, and packet filter. Capitalizing on newly developed abstractions, our approach integrates the design of these three elements into a unified framework by systematically modeling how the design of one element may impact the complexity of other elements. Our approach goes substantially beyond the current divide-and-conquer approach that designs each element in complete isolation, and enables minimizing the combined complexity of all elements. Specifically, two new optimization problems are formulated, and novel algorithms and heuristics are developed to solve the formulated problems. Evaluation on a large campus network shows that our approach can effectively reduce the packet filter complexity and VLAN trunking complexity by more than 85% and 70%, respectively, when compared with the ad hoc approach currently used by the operators.