Graphical overview of the paper. Volume 8 , Issue 2. The full text of this article hosted at iucr. If you do not receive an email within 10 minutes, your email address may not be registered, and you may need to create a new Wiley Online Library account. If the address matches an existing account you will receive an email with instructions to retrieve your username.
Read the full text.
Feature selection for knowledge discovery and data mining pdf
Tools Request permission Export citation Add to favorites Track citation. Share Give access Share full text access. Share full text access. Please review our Terms and Conditions of Use and check box below to share full-text version of article. Using these measures, we propose a greedy feature selection algorithm, CovSkew, for multiclass binary data.
We also show that CovSkew has low computational costs compared with most of the baselines. Pattern Recognition and Data Mining. Classification of information Data mining Image recognition Information retrieval systems Learning systems. Mon, 10 Dec , EST.
Data mining projects increasingly require records about individuals to be linked across databases to facilitate advanced analytics. The process of linking records without revealing any sensitive or confidential information about the entities represented by these records is known as privacy-preserving record linkage PPRL. Bloom filters are a popular PPRL technique to encode sensitive information while still enabling approximate linking of records.
-  Feature Selection On Boolean Symbolic Objects;
- GLIM 82: Proceedings of the International Conference on Generalised Linear Models!
- Citazioni duplicate!
- Half a World Away;
- RAR Feature Selection For Knowledge Discovery And Data Mining Pdf;
However, Bloom filter encoding can be vulnerable to attacks that can re-identify some encoded values from sets of Bloom filters. Existing attacks exploit that certain Bloom filters can occur frequently in an encoded database, and thus likely correspond to frequent plain-text values such as common names. We present a novel attack method based on a maximal frequent itemset mining technique which identifies frequently co-occurring bit positions in a set of Bloom filters.
Our attack can re-identify encoded sensitive values even when all Bloom filters in an encoded database are unique. As our experiments on a real-world data set show, our attack can successfully re-identify values from encoded Bloom filters even in scenarios where previous attacks fail. Bayesian optimization is a powerful machine learning technique for solving experimental design problems.
With its use in industrial design optimization, time and cost of industrial processes can be reduced significantly. However, often the experimenters in industries may not have the expertise of optimization techniques and may require help from third-party optimization services. This can cause privacy concerns as the optimized design of an industrial process typically needs to be kept secret to retain its competitive advantages. To this end, we propose a novel Bayesian optimization algorithm that can allow the experimenters from an industry to utilize the expertise of a third-party optimization service in privacy preserving manner.
Privacy of our proposed algorithm is guaranteed under a modern privacy preserving framework called Error Preserving Privacy, especially designed to maintain high utility even under the privacy restrictions. Using several benchmark optimization problems as well as optimization problems from real-world industrial processes, we demonstrate that the optimization efficiency of our algorithm is comparable to the non-private Bayesian optimization algorithm and significantly better than its differential privacy counterpart.
Robust machine learning algorithms have been widely studied in adversarial environments where the adversary maliciously manipulates data samples to evade security systems. In this paper, we propose randomized SVMs against generalized adversarial attacks under uncertainty, through learning a classifier distribution rather than a single classifier in traditional robust SVMs.
The randomized SVMs have advantages on better resistance against attacks while preserving high accuracy of classification, especially for non-separable cases. The experimental results demonstrate the effectiveness of our proposed models on defending against various attacks, including aggressive attacks with uncertainty. Growth of male fashion industry and escalating popularity of affordable street fashion wear has created a demand for the intervention of effective data analytics and recommender systems for male street wear. This motivated us to undertake extensive image collection of male subjects in casual wear and pose; assiduously annotate and carefully select discriminating features.
We build up a classifier which predicts accurately the attractive quotient of an outfit. Further, we build a recommendation system - MalOutRec - which provides pointed recommendation of changing a part of the outfit in case the outfit looks unattractive e. We employ an innovative methodology that uses personalized pagerank in designing MalOutRec - experimental results show that it handsomely beats the metapath based baseline algorithm. Due to the increasing popularity of location-based services, a massive volume of human mobility records have been generated.
At the same time, the growing spatial context data provides us rich semantic information. Associating the mobility records with relevant surrounding contexts, known as the location annotation, enables us to understand the semantics of the mobility records and helps further tasks like advertising. However, the location annotation problem is challenging due to the ambiguity of contexts and the sparsity of personal data.
This method leverages user grouping and venue categories to alleviate the data sparsity issue and annotates locations according to multi-view information spatial, temporal and contextual of multiple granularities.
- Feature Selection for Knowledge Discovery and Data Mining - Semantic Scholar.
- See a Problem?!
- American Indian Education: Counternarratives in Racism, Struggle, and the Law (Critical Educator (Hardcover))?
- Feature selection for multiclass binary data - RMIT Research Repository;
- Make: Analog Synthesizers!
Through extensive experiments on a real-world dataset, we demonstrate that our method significantly outperforms other baseline methods. In recommendation systems, items of interest are often classified into categories such as genres of movies. Existing research has shown that diversified recommendations can improve real user experience. We propose an algorithm that considers user preferences for different categories when recommending diversified results, and refer to this problem as personalized recommendation diversification.
In the proposed algorithm, a model that captures user preferences for different categories is optimized jointly toward both relevance and diversity. To provide the proposed algorithm with informative training labels and effectively evaluate recommendation diversity, we also propose a new personalized diversity measure. The proposed measure overcomes limitations of existing measures in evaluating recommendation diversity: existing measures either cannot effectively handle user preferences for different categories, or cannot evaluate both relevance and diversity at the same time.
Experiments using two real-world datasets confirm the superiority of the proposed algorithm, and show the effectiveness of the proposed measure in capturing user preferences. In recent years, word embedding models receive tremendous research attentions due to their capability of capturing textual semantics.
Computational Methods of Feature Selection by Huan Liu
This study investigates the issue of employing word embedding models into resource-limited smartphones for personalized item recommendation. The challenge lies in that the existing embedding models are often too large to fit into a resource-limited smartphones. One naive idea is to incorporate a secondary storage by residing the model in the secondary storage and processing recommendation with the secondary storage.
However, this idea suffers from the burden of additional traffics. To this end, we propose a framework called Word Embedding Quantization WEQ that constructs an index upon a given word embedding model and stores the index on the primary storage to enable the use of the word embedding model on smartphones.
Get this edition
One challenge for using the index is that the exact user profile is no longer ensured. However, we find that there are opportunities for computing the correct recommendation results by knowing only inexact user profile. In this paper, we propose a series of techniques that leverage the opportunities for computing candidates with the goal of minimizing the accessing cost to a secondary storage.
Experiments are made to verify the efficiency of the proposed techniques, which demonstrates the feasibility of the proposed framework. In this paper, we study topic-specific retweet count ranking problem in Weibo. Two challenges make this task nontrivial. Firstly, traditional methods cannot derive effective feature for tweets, because in topic-specific setting, tweets usually have too many shared contents to distinguish them.
We propose a LSTM-embedded autoencoder to generate tweet features with the insight that any different prefixes of tweet text is a possible distinctive feature. Secondly, it is critical to fully catch the meaning of topic in topic-specific setting, but Weibo can provide little information about topic. We evaluate the proposed components based on ablation methods, and compare the overall solution with a recently-proposed tensor factorization model.
Extensive experiments on real Weibo data show the effectiveness and flexibility of our methods. Characterizing and understanding information diffusion over social networks play an important role in various real-world applications.
In many scenarios, however, only the states of nodes can be observed while the underlying diffusion networks are unknown. Many methods have therefore been proposed to infer the underlying networks based on node observations. To enhance the inference performance, structural priors of the networks, such as sparsity, scale-free, and community structures, are often incorporated into the learning procedure. As the building blocks of networks, network motifs occur frequently in many social networks, and play an essential role in describing the network structures and functionalities.
However, to the best of our knowledge, no existing work exploits this kind of structural primitives in diffusion network inference. In order to address this unexplored yet important issue, in this paper, we propose a novel framework called Motif-Aware Diffusion Network Inference MADNI , which aims to mine the motif profile from the node observations and infer the underlying network based on the mined motif profile. The mined motif profile and the inferred network are alternately refined until the learning procedure converges. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of the proposed framework.
Given a graph stream, how can we estimate the number of triangles in it using multiple machines with limited storage? Counting triangles i. Recently, for triangle counting in massive graphs, two approaches have been intensively studied. One approach is streaming algorithms, which estimate the count of triangles incrementally in time-evolving graphs or in large graphs only part of which can be stored.
The other approach is distributed algorithms for utilizing computational power and storage of multiple machines.