4th Swedish Workshop on Data Science
(SweDS 2016)

Skövde, Sweden November 10-11, 2016
http://www.mdai.cat/sweds16


Dates:

Dates and venue
The workshop will take place on November 10-11th, 2016 at the University of Skövde. The workshop will take place in Room: D107 (Building: D) on November 10th and in Room: Insikten (Building: Portalen) on November 11th. (Buildings D and 8 in the campus map here.
Program and registration procedure can be found below.
The SweDS workshop is collocated with the 11th Annual Workshop in Systems Biology. SweDS participants are invited to participate in it (program available here).

Program: Tentative program

November 10th (Thursday), 2016

(Venue: Room D107, Building D, U. of Skövde)

09:30 - 10:00 Opening
  • Jörgen Hansson, Head of the School of Informatics
10:00 - 10:50 Invited talk:

  • Ola Gustafsson (Dagens Nyheter), Chair: Alan Said
    News recommendation with diversity and reduced gender bias at Dagens Nyheter

    Online news recommendation is a field of many practical challenges but also one of conflicting goals. As we optimize personalized recommendations for pageviews and click-through rates, we may also reinforce bias and the creation of filter bubbles. At Dagens Nyheter, we design algorithms for personalization in a way that aims to align with editorial ambitions of diversity and reduced gender bias. Maybe it is time to talk about consumer awareness, as an aspect of algorithmic design?
11:00 - 12:00 Session Chair: Maria Riveiro
  • A smoothed monotonic regression via L2 regularization
    Oleg Sysoev, Oleg Burdakov (LIU)
    Monotonic Regression (MR) is a standard method for extracting a monotone function from non-monotonic data, and it is used in many applications. However, a known drawback of this method is that its fitted response is a piecewise constant function, while practical response functions are often required to be continuous. We propose a method that achieves monotonicity and smoothness of the regression by introducing an L2 regularization term, and it is shown that the worst-case complexity of this method is O(n2). In addition, our simulations demonstrate that the proposed method is very fast, i.e. it is able to fit more than a million of observations in less than one minute, and it has a higher predictive power than some commonly used alternative methods, such as monotonic kernel smoothers. In contrast to these methods, our approach is probabilistically motivated and has connections to Bayesian modeling.
    [1] O. Sysoev and O. Burdakov. A smoothed monotonic regression via l2 regularization. Technical Report LiTH-MAT-R–2016/01–SE, Department of Mathematics, Linkoping University, 2016. [urn.kb.se]
  • Approximate Search in Large Intrusion Detection and SPAM Filtering Data Sets
    Ambika Shreshta Chitrakar, Slobodan Petrovic (NTNU, Gjøvik, Norway)
    Due to enormous amount of data traffic and high data rates, the quantities of data to be analyzed by Intrusion Detection Systems (IDS) and SPAM filters per unit of time have become the limiting factor of their further development. The efficiency of current search algorithms used in these systems is not high enough for real time data processing that is necessary in order for the attacks against computer systems to be detected in real time. There are two possibilities for further development of these systems: improving the efficiency of the search algorithms and reducing the data sets to be analyzed at a time. Regarding the efficiency of the search algorithms, there are limitations to using the theoretically best possible algorithms (so-called skip algorithms, such as Backward Non-deterministic DAWG Matching) since these algorithms are sensitive to algorithmic attacks against the very IDS. Namely, the average-case time complexity of these search algorithms is much better than the worst-case complexity and consequently an attacker can deliberately send attack traffic that makes these algorithms perform poorly. On the other hand, regarding the size of the data sets to be processed at a time, there is potential for reduction since many attack signatures have common structure due to the fact that the new attacks often originate from the old ones. This drives application of approximate search in intrusion detection, which is capable of detecting many similar attacks with only one execution of the search algorithm. Thus, by using approximate search in intrusion detection, we obtain a more efficient search algorithm operating over a reduced dataset. This has potential of significant improvement of the efficiency of IDS. In SPAM filtering, the spammers want to avoid detection by deliberately changing the SPAM words. Approximate search is also capable of detecting such cases. To avoid so-called false positives and false negatives in both intrusion detection and SPAM filtering, we introduce constraints in the approximate search algorithms that limit the total numbers of edit operations and/or the lengths of runs of edit operations. The constraints exploit the fact that the attackers/spammers cannot apply just any number of edit operations on the traffic they generate and that the distribution of these changes cannot be arbitrary. Otherwise, the attacks might behave in an unpredictable way and the SPAM messages would lose their intelligibility. This talk explains how these constraints are used and what their effect is on the numbers of false positives/negatives and the efficiency of the search algorithms.
    [1] A. S. Chitrakar, S. Petrović, Approximate search with constraints on indels with application in SPAM filtering, Proc. Norwegian Information Security Conference (NISK-2015), pp. 22-33.
    [2] A. S. Chitrakar, S. Petrović, Constrained row-based bit-parallel search in intrusion detection, submitted to NISK 2016.
  • Root-Cause Localization using Restricted Boltzmann Machines
    H. Joe Steinhauer, Alexander Karlsson, Gunnar Mathiason, Tove Helldin (HIS)
    Monitoring complex systems and identifying degrading system components before the system or parts thereof fail, is crucial for many application areas, among them today’s and future telecommunication systems. With the increasing complexity of such systems, the need to aid human operators through the use of machine learning tools is growing. In this paper, we present an automated approach for root-cause localization, a first step towards root-cause analysis, using a Restricted Boltzmann Machine (RBM). We describe an experiment conducted on data with ground truth, stemming from a simplified network. We use the RBM to cluster symptoms of degradation and we show how the results produced by the RBM capture the location of different possible combinations of hidden root causes.
    [1] H. J. Steinhauer, A. Karlsson, G. Mathiason, T. Helldin, Root-Cause Localization using Restricted Boltzmann Machines, Proc. 19th Int. Conf Information Fusion.

12:00 - 13:30 Lunch time

13:30 - 14:20 Invited talk:

  • Svetoslav Marinov (Seal Software) Chair: Joe Steinhauer
    Machine learning for contract analytics

    At Seal Software we apply Machine Learning techniques extensively to analysing legal contracts and we use both supervised, and unsupervised learning. With the latest release of our product we give the users the possibility to create their own models based on their own manually annotated data. This feature came with a lot interesting problems, like which are the optimal evaluation techniques, as well as how should we handle imbalanced learning (e.g. where the underlying training data is quite skewed). In this talk, I will walk you through the way we are tackling these and some other problems in a user-driven machine learning environment.
14:30 - 15:30 Session Chair: Tove Helldin
  • Efficient Parameter Tuning for Image Binarization
    Florian Westphal, Håkan Grahn, Niklas Lavesson (BTH)
    Image binarization is the first important processing step when making historical document images searchable, transcribing them or analyzing the layout of these documents. A good binarization quality is paramount for those tasks, since only the detected foreground pixels will be processed further. With ever growing collections of digitized documents, efficient processing becomes equally vital. In our work, we propose a fast way for tuning the parameters of a state-of-the-art binarization algorithm. These parameters adjust the binarization algorithm to a given image to improve binarization quality. By predicting the algorithm¹s parameters based on image features, such as contrast, homogeneity, edge mean intensity and background standard deviation, we are able to tune the algorithm¹s parameters on average 3 times faster than previous approaches. This is an average time difference of 17 seconds on a standard binarization dataset, which is achieved without decreasing the binarization quality.
  • FlinkML: Large Scale Machine Learning with Apache Flink
    Theodore Vasiloudis (SICS & KTH)
    Apache Flink is an open source platform for distributed stream and batch data processing. In this talk we will show how Flink's streaming engine and support for native iterations make it an excellent candidate for the development of large scale machine learning (ML) algorithms.
    This talk will focus on FlinkML, an effort to develop scalable machine learning tools utilizing the efficient distributed runtime of Apache Flink. We will provide an introduction to the library, illustrate how we employ state-of-the-art algorithms for classification, regression and recommendation to make FlinkML truly scalable, and provide a view into the challenges and decisions one has to make when designing a robust and scalable machine learning library. A focus will be given on the challenges encountered in developing a community-driven open-source ML library.
    Finally, if time permits, we will demonstrate how one can perform interactive analysis using FlinkML and the notebook environment of Apache Zeppelin, combining the power of a distributed processing engine with more traditional data science tools like the Python scipy stack.
  • Reconstruction of Causal Networks to Resolve Conflicting Causal Inferences
    Sepideh Pashami(1), Anders Holst(2), Sławomir Nowaczyk(1) ((1) Halmstad University, (2) SICS)
    One of the challenges for maintenance of vehicles is finding the root cause of the fault in order to avoid reoccurrence of the failure and to further avoid undesirable follow-up failures. Besides, large savings can be obtained by focusing on components where failures are likely to cause costly collateral damage, for example, over-speeding and destruction of turbocharger often leads to additional engine damage. A causal network is useful for providing the overall picture of how various parameters are affecting the vehicle’s performance.
    With the increased amount of diverse data being collected, there are new opportunities to distinguish between correlation and causation, either automatically or semi-automatically. The focus of this work is to identify the causal relations between signals measured on-board heavy duty vehicles. A proposed method reconstructs the causal network using the PC algorithm [1] in order to get closer to underlying causal structures between signals. The PC algorithm builds a Markov equivalence class which contains the underlying causal graph, and represents it by a Completed Partially Directed Acyclic Graph (CPDAG). The calculations are based on some forms of conditional independence test. In many cases, the CPDAG produced contains bi-directed edges. Such bi-directed edges are undesirable, as they imply the existence of a confounding variable, i.e. an unknown factor which influences both of the signals (nodes). Similarly, fully connected nodes (cliques) within causal network are undesirable due to the ambiguity of identifying cause and effect. We relax the sufficiency assumption by adding one or more latent variables. We connect the latent variables to the nodes in the cliques. Further, we assign values to these latent variables in a way that nodes in each clique become independent given the corresponding latent variable. After that, the PC algorithm is rerun with the new set of variables until no more conflicting variables remains.
    The effectiveness of the proposed approach is demonstrated on a data set collected by a fleet of five city buses. In particular, the aim is to identify the causal relation between the set of signals influencing fuel consumption. This analysis is performed only based on observational data, without the need for specifying the underlying physical model.
    [1] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, 2nd ed. Cambridge, MA: MIT Press, 2000.
15:30 - 16:00 Coffee
16:00 - 17:00 Session Chair: Göran Falkman
  • Energy Efficiency in Machine Learning
    Eva García Martín, Niklas Lavesson, Håkan Grahn (BTH)
    Energy efficiency has become a key concern during the past years in software development. Researchers in machine learning are starting to understand the importance of designing solutions from a sustainable perspective. For instance, deep learning techniques and algorithms, known for their high performance solutions, are being optimised towards energy efficiency. Google is also applying machine learning techniques to reduce the energy consumption of their servers. However, we believe that there needs to be a systematic approach to add the energy efficiency variable to algorithm analysis, since algorithms are still developed considering the traditional variables.
    The goal of this study is to present a reproducible approach to analyze and optimize machine learning algorithms from an energy efficiency perspective. The energy consumption is examined together with the accuracy, to portray the different trade-offs that exist when trying to reduce the energy consumption of a computation. We created an experiment where we measure the energy consumption and accuracy of different data stream mining algorithms and algorithmic setups. The results show that energy can be reduced by 74.29% in the Very Fast Decision Tree algorithm by sacrificing accuracy with just 1%.
    The main contribution of this work is the validation that energy is an important factor to take into consideration when designing algorithms. Since this factor is often overlooked, we believe that different algorithms could be chosen for a specific task from a green computing perspective depending on the accuracy constraints. This work also enables us to optimise algorithms for computer platforms with scarce resources, such as embedded systems.
    [1] Work in progress. Submitted also to WiML (Women in Machine Learning) 2016
  • Random Forest Response Surfaces for Robust Design
    Siva Krishna Dasari (1), Niklas Lavesson (1), Johan Wall (1), Petter Andersson (2) ((1) BTH, (2) GKN Aerospace Engine Systems Sweden)
    Intelligent data analysis is increasingly used within the area of product development for different types of decision support. One example of this is the construction of response surface models (RSMs) in a model-based approach to robust design. The construction of RSMs requires a dataset of inputs, and known outputs from simulated experiments. Since simulations are expensive to conduct, datasets are usually small in a real-world context. The size of the data sets and the complexity of the underlying simulation model make it difficult to generate accurate and robust RSMs. For robust design, aiming to reduce the variation in system performance, sensitivity analysis (SA) based on RSMs allows for efficient studies of how uncertainties in input parameters affect system performance. In this study, we investigate the applicability of Random Forests (RF) for RSMs and consecutive SA. The reasons for selecting RF are that: (1) it can handle non-linear data (2) it can handle high-dimensional data (3) it gives the importance of parameters by ranking them (4) ensemble methods generally build accurate models compared to single models and (5) if the design engineers need information about variable interactions, it is possible to extract human understandable decision rules from tree models. To determine whether RF can perform as well as other methods, we compare RF to Multivariate Adaptive Regression Splines (MARS) and Support Vector Machines (SVM). We conducted two experiments using anonymized real world and synthetic data for RSMs and SA respectively. The output from the RSMs suggests that the three studied algorithms perform equally well with respect to predictive accuracy. The output from the SA suggests that RF and MARS performs equally well and better than SVM on non-linear response. Furthermore it was shown that RF is more computationally efficient compared to MARS and SVM. Experimental results combined with other potential benefits of RF related to robust design, such as the ability to screen parameters, indicates that RF is suitable for the intended application.
    [1] Siva Krishna Dasari, Niklas Lavesson, Johan Wall and Petter Andersson. Random Forest Response Surfaces for Robust Design. Submitted to IEEE Int. Conf. on Machine Learning and Applications (IEEE ICMLA'16), 2016.
  • Data privacy and data provenance for big data
    Vicenç Torra (HIS)
    In this work we will discuss the problem of data privacy for big data. We will focus on the relationship between data provenance and data privacy, and show that data provenance can be used to implement the right to amend and the right to be forgotten. We will introduce a definition of privacy for the case in which modifications in a data base are sensitive information.
    [1] V. Torra, G. Navarro-Arribas, Integral Privacy, Proc. CANS 2016, LNCS 10052, 661-669.

November 11th (Friday), 2016

(Venue: Room Insikten, Building Portalen, U. of Skövde)

08:30 - 9:00 Coffee
09:00 - 09:50 Invited talk:

  • Alexander Schliep (Gothenburg University) Chair: Alexander Karlsson
    Compressive Omics: Data science for biomedical applications

    The biomedical field has seen an enormous change due to the rapid advances in throughput and cost of experimental instruments, automatization as well as expansion of experiment types and modalities. For example, High-Throughput Sequencing (HTS), a technology to unravel genomic sequences on a large scale, is pervasive in clinical and biological applications such as cancer research and basic science, and is expected to gain enormous momentum in future precision medicine applications. As a consequence, the storage, processing and transmission of HTS data poses great challenges for method developers and practitioners. Compressive Omics, the use of compressed and reduced representations of biological data, was identified by the NIH as one of the core techniques for developing methods which can keep up with the increases in data. In contrast to typical uses of compression to reduce storage requirements, the focus is on representations suitable for computational analysis. We will present our work on using compressed and reduced representations for accelerating advanced statistical computations at genome-scale. This includes recent results on fully Bayesian Hidden Markov Models for identifying Copy Number Variants (segmentation of observation sequences) and compressive genomics.
10:00 - 10:30 Coffee
10:30 - 11:50 Session Chair: Juhee Bae
  • Short-term highway traffic prediction using dynamic parameter combinations in k-nearest neighbours with comprehensive understanding of its parameters
    Bin Sun, Wei Cheng, Prashant Goswami, Guohua Bai (BTH)
    Increasing road traffic is nowadays causing more congestion and accidents which gain more attention from public and authorities due to severe loss of lives and properties. To achieve efficient traffic management and accident detection, reliable and accurate short-term traffic forecasting is necessary. The k-nearest neighbours (KNN) method is widely used for short-term traffic forecasting, but choosing the right parameter values for KNN is problematic due to dynamic traffic characteristics. In this work, we comprehensively analyse the relationship among all three KNN parameters, which are number of nearest neighbours, search step length/lag, and widow size/shift constrain. We observed that individual parameter optimization cannot lead to the best parameter values. Thus, optimizing three parameters simultaneously is necessary. We propose a dynamic procedure that can use suitable parameter combinations of KNN to predict traffic flow metrics. The proposed procedure adjust combinations dynamically according to traffic flow situations. The results show that KNN with dynamic procedure performs better than benchmarking methods.
    [1] Submitted to journal: IET Intelligent Transport Systems, June 2016.
  • Market Share Prediction Based on Scenario Analysis Using a Naive Bayes Model
    Shahrooz Abghari, Niklas Lavesson, and Håkan Grahn (BTH)
    In today’s competitive marketing environment such as telecom, companies fight for gaining a larger market share. The importance of the market share is due to its direct relationship with the profitability and sustainability of a company. The higher the share, the greater a company’s chances to achieve high revenues that can consolidate the company’s position in the market. Designing a successful marketing strategy is a complicated process that always requires a careful and realistic assessment of a company's market position. Moreover, it requires an estimate of the future market size growth based on supply and demand scenarios together with an estimate of the future market shares based on competitors’ objective and strategies. In order to overcome this complicated process, companies often aggregate different sources of information to analyze different scenarios and predict the market shares. For instance, the market share can be predicted by considering the effect of introducing a new product/service to the current market and/or a new market, or the effect of increasing, decreasing, and mentioning the company’s share in one specific region. This work presents an ongoing case-study from a telecom company and aims at predicting the market share throughout applying different scenarios. These scenarios range from introducing a new service/product, aiming a new market to analyzing the effect of economic crises on the market share. To achieve this a decision modeling based on a naive Bayes model is proposed. The naive Bayes model is a useful tool for reasoning under uncertainty and where expert’s knowledge is not complete and/or ambiguous. The performance of the model is going to be evaluated with real data and the validity of the results will be investigated by the experts at the telecom company.
  • Learning Meaningful Knowledge Representations for Self-Monitoring Applications
    Sławomir Nowaczyk, Sepideh Pashami, Mohamed-Rafik Bouguelia, Antanas Verikas (Halmstad University)
    The ability to learn meaningful and useful data representations is crucial for machine learning applications that need to maintain good performance as they are being applied to more and more general problems in more and more complex settings. In this work we present an awareness-based self-monitoring system that learns how to continuously extract useful information from streaming data. In a setting when a group of individuals is available for observation and comparisons, new challenges arise with regards to representation learning.
    We focus on the possibility of creating a general meta-framework for the purpose of self-monitoring, more specifically, to detect an emerging anomaly for an individual system, one that can lead to a failure in a near future. We are interested in finding an appropriate representation of the “normal” behavior or reference, both for an individual and for the whole group, based on parameters such as configuration, task or external conditions. A suitable comparison method then needs to be developed to determine whether a given observation, or a set of observations over time, is sufficiently similar to this reference to be considered normal.
    We have identified several different aspects and challenges in that context. First, one needs to automatically select the most suitable representation paradigms to represent the available data in a compact and expressive way, based on the properties of the data, the details of the task to be solved, and other constraints of the domain. The automatic determination of relevant data representation requires designing models that allow for efficient learning, while being flexible enough to capture different aspects of the data simultaneously and take into account different kinds of initial domain and expert knowledge.
    Second, a common assumption for anomaly detection is that the majority of observations represent the correct behavior of the system. However, in many real-world applications, both complete failures and the optimal functioning are quite rare, and the majority of the data corresponds to a mediocre operation.
    Moreover, patterns of normal behavior become less and less distinct as the number of individuals within the reference group increases, due to the inevitably higher variability. Appropriate groups for self-monitoring are based on domain-specific relevant criteria, and influenced by external factors such as weather, which can be subject to change over time. Distinguishing between usual changes (e.g. seasonal changes) and unusual changes that are anomalous (e.g. indicating a deviation of an individual from a group) poses challenges for the representation learning.
    Finally, the created models should not only be capable of providing end users with descriptive, explanatory analysis, and rich visualization functionality, but should also be capable of taking into account the expert’s feedback in order to improve representations.
  • Accelerating science with big data at Chalmers
    Hans Salomonsson, Oscar Ivarsson, Azam Muhammad Sheikh, Pramod Bangalor (Chalmers)
    The recent revolution in data science has had a profound influence on the research methods in many areas. In order to take full advantage of this revolution, researchers need to develop and adapt to these new research methodologies. But the complexity of the underlying methods, tools and infrastructure can be overwhelming for individuals within a single research group. To address this problem, Chalmers has created a pool of data science experts that work across research groups. This talk will give an overview of a few projects the team has completed to demonstrate the usefulness and variety of big data technologies.
11:50 - 12:00 Closing session

Registration: In order to register for the workshop:
Send a message to sweds16@his.se (deadline for registration: October 20th) with the following information
  • Email subject: Registration to SweDS 2016
  • Name
  • Affiliation (complete with address)
  • Email address
  • Inform if you will participate both days or only one of them
  • If you are the presenter of any of the papers above, please, include the title of the paper
  • (No registration fees, coffee break included)
    SweDS 2016: 4th Swedish National Workshop on Data Science (SweDS)
    Workshops allow members of a community with common interests to meet in the context of a focused and interactive discussion. SweDS-16, the fourth Swedish Workshop on Data Science, brings together researchers, practitioners, and opinion leaders with interest in data science. The goal is to further establish this important area of research and application in Sweden, foster the exchange of ideas, and to promote collaboration. As a followup to the very successful previous editions, held at the University of Borås, Stockholm University, and Blekinge Institute of Technology we plan two full days of inspiring talks, discussion sessions, a student forum, and time for networking. We invite stakeholders from academia, industry, and society to share their thoughts, experiences and needs related to data science. The workshop is organised by the Skövde Artificial Intelligence Lab (SAIL) at the University of Skövde. Contact: Vicenç Torra (SweDS 2016 chair) and the SAIL team at sweds16@his.se.

    Submission: Abstract submission
    We invite academic researchers as well as industrial researchers and practitioners to submit short abstracts (150-500 words) in one of the following categories: (1) original research, (2) new/relevant challenge, (3) status report of ongoing work. Abstracts will be screened and selected based on relevance and quality. The authors of accepted abstracts are given approx. 20-30 minutes to present (incl. questions/discussion).

    Submit your abstracts to sweds16@his.se. The email should include the following

  • Email subject: SweDS 2016
  • Title of the talk
  • Authors
  • Affiliations
  • Abstract (in plain text in the body of the message, use latex commands if formulas are needed)
  • Reference(s). If the paper has already been published, include full reference of the publication.

  • Dates: Important dates
    • Abstract submission deadline: Sep 8, 2016
    • Notification to authors/presenters: Sep 10, 2016
    • Workshop registration: Sep 16, 2016
    • Workshop: Nov 10 - 11, 2016

    Topics: What is data science and why is it relevant?

    Data science focuses on the extraction of knowledge from data. The overall aim is to make better use of the ever increasing amount of data generated by individuals, societies, companies, and science. To achieve this aim, the objectives are to identify relevant challenges and problems, to study, develop and evaluate solutions based on efficiency and effectiveness, and to perform successful implementations. Data science is based on theory and methods from many fields, including: computer vision, data mining & knowledge discovery, machine learning, optimization, statistics, and visualization.

    Data science is not about blindly sifting through data in the hope for interesting results and discoveries. In contrast, data science requires the ability to make sense of our complex world and the domain under study, and to use this understanding and the available data to develop suitable mathematical models that help us explain and predict interesting phenomena.

    Topics of Interest include, but are not limited to, the following

    Methods and Algorithms

    • Classification, Clustering, and Regression
    • Probabilistic & Statistical Methods
    • Graphical Models
    • Spatial & Temporal Mining
    • Data Stream Mining
    • Feature Extraction, Selection and Dimension Reduction
    • Data Cleaning, Transformation & Preprocessing
    • Multi-Task, Multi-label, and Multi-output Learning
    • Big Data, Scalable & High-Performance Computing Techniques
    • Mining Semi-Structured or Unstructured Data
    • Text & Web Mining
    • Data privacy

    Applications

    • Image Analysis, Restoration, and Search
    • Climate/Ecological/Environmental Science
    • Risk Management and Customer Relationship Management
    • Genomics & Bioinformatics
    • Medicine, Drug Discovery, and Healthcare Management
    • Automation & Process Control
    • Logistics Management and Supply Chain Management
    • Sensor Network Applications and Social Network Analysis


    CFP: Call for papers in plain text (cfp.txt)
    Support: