ECML 2019 Notes

October 4, 2019 17 minute read

In September 2019 I had the good fortune to attend ECML 2019 in beautiful Wurzburg, Germany.

I was there to present our paper

Bhalla, S. et al., 2019. Compact Representation of a Multi-dimensional Combustion Manifold Using Deep Neural Networks. In European Conference on Machine Learning. Wurzburg, Germany, p. 8. read more pdf

It was a fantastic conference, a nice size and a real diversity of topics, not just a whole bunch of permutations on Deep Learning. What follows are some notes I took at talks which I attended, you can take a look at the full schedule here.

Monday Talks

SridharMahadevan Tutorial

he’s now at AdobeResearch
can also see his slides for IJCAI which was longer than this one

Imagination and Causation

ImaginationScience
- he wants to do reasoning about future paths and outcomes without having data to back it up
- how does he distinguish is from counterfactual reasoning it is just one subset part of it
- also includes: analogy, abstraction, spatial/temporal projection
- what is Combinatorial Creativity? mixing and matching existing compoenets that never occur, like a sphinx
example: long term predictions for climate change
- not just about predicing the next step but about reasoning baout what society will do for the next few decades
example: search engine
- what is the next step?
- “Improbable” company essentially is trying to simulate reality
he thinks GANs are really only novel thing to come out of Deep learning recently
- he doesn’t think ML will be fundamentally different in 20 years, it’s solved essentially
he’s talking about machines creating art
- can machines ever create art?

Pearl’s Ladder of Causality

This is that new three layered approach by JudeaPearl
“The Book of Why” where he explains it in simple approach
Association
observational machine learning
Intervention
Experimental science
Pearl: probabilities are an epiphenomanon resulting from the causal nature of the world

Counterfactuals

Can be thought of as necessary for Imagination

Combining Observation and Causality

GANs - solve problem is visualize imagination
- he says GANs are related to Actor-Critic RL problem
- Actor-Critic models converge but it wasn’t known until 20 years later
- Wasserstein GANs have more theoretical bassis
Seeing GANs as an optimization problem
- min_G max_D V(D,G)

Art and Imagination

CANs are a new model for simulating creativity
Can create fairly nice modern art paintings, but is it art or it sampling a space of images?
- Is there a difference?
- Does it mean anything? does it need to?

Relevant Optimization methods from Economics which would be useful for GANs

Optimization vs Equilibration
- minimize a function in feasible set
- usually need to assume f(x) is differentiable
Equilibration from physics (Stampacchia in the 1960s)
- they define the set of partial gradients for a function as a ‘vector field’
- assume this field is given rather than f(x) being given
- means we can solve optimization problems but also other problems
- eg. there are some vector fields that don’t have a function with a gradient that generates the vector field
- economists use this for domains where the vector field can be written down easily but no simple function leads to it via derivatives
Variational inequality
- eg. traffic management
- GANs are this kind of problem whihc is why they work
- he thinks game theory, GANs, traffic, RL etc are better solved using this approach
GANs don’t converge if you just do gradient descent
- but if you use the Wasserstein loss function then get a nash equilibrium?
- the surface for a GAN is a saddle so gradient descent is bad, very unlikely to lead to a optimal appoint, you will cycle around
- a better idea sometimes is to move orthogonal to the gradient because the think you are trying to do is find an equilibrium point
Extragradient Method
- Frobenius projection from the gradient
- Project back to the main function if the negative vector field leads you out of the feasible set
Mirror Descent instead of Gradient Descent - by Nemirovsky and Yudin
- gradients are in the dual of the original space, and this happens to line up in euclidean but wouldnt work others, so you can’t add them relaly
- essentially you convert the state vector to the dual space first, then compute the gradient, update it and project back to the original space via a conjugate function
- in Euclidean space, this whole mechanism collapses to gradient descent because the conversion is identity
- this explains why multiplicative updates of gradients work better in many spaces
- you can explain many ML methods using this idea
  - boosting
  - natural gradients - gradient descent should be done in reimanian space using the Fischer information matrix
    - Kakade’s paper on Natural Actor-Critic for playing Tetris
Mirror-prox
- take two gradient steps in the dual space, works even better
He calls the dual space the imagination space because it handles (something)
Reinforcement Learning as a type of Causal Learning
where you can try out particular actions
imagination values
- what is the value of deviating from the action the policy advices
- counterfactural kind of value function
- not needed for an MDP but for POMDPs yuo need to imagine the possible worlds
- even for multi-armed bandits it can help, because it leads you outside the existing policy actions
- is it just exploration? it is related to off-policy exploration, how?

Tuesday Talks

Active Learning Anomaly Detection

Unsupervised and Active Learning using Maximin-based Anomaly Detection
Zahra Ghafoori (University of Melbourne), James C. Bezdek (University of Melbourne), Christopher Leckie (University of Melbourne), Shanika Karunasekera (University of Melbourne)
took some pictures of the AD background description which was very good
- semi-supervised which an oracle that can be queried about whether points are anomalies or not
they do active learning and compare to standard AD methods like OCSVM and iForest
better quality than iforest and faster because it only builds one model

OneClass Anomaly Detection

The Elliptical Basis Function Data Descriptor (EBFDD) Network - A One-Class Classification Approach to Anomaly Detection
MehranBazargani (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin), Brian Mac Namee (The Insight Centre for Data Analytics, School of Computer Science, University College Dublin)
a cost functiont hat turns RBF networks in to a one class classifier
assumptions
- they are interested in streaming data
- training data does not include any anomalies
RBF Networks
- overview
- three layers
  - input data
  - hidden layer of gaussian kernals, initialize the (m,v) with k-means
  - output layer lineraly combies them via a sigmoid
- backprop learns the paramters of the gaussians
- not applicable to one class
their change
- main idea: try to tightly fit the gaussians around the subspace of the normal points
- introduce elliptical kernals instead of spherical
- the ellipses can be stretched and shaped to adjust the amount of correlation (or not) between the dimensions
- in stead of gaussian (m,v) they have cov matrix for each kernel
expeirments
- they use the emmot and dietterich paper as their guide
  Autoencoder Anomaly Detection
Robust Anomaly Detection in Images using Adversarial Autoencoders
LauraBeggel (Bosch Center for Artificial Intelligence, Renningen; Ludwig-Maximilians-University Munich), Michael Pfeiffer (Bosch Center for Artificial Intelligence, Renningen), Bernd Bischl (Ludwig-Maximilians-University Munich)
They use an AutoEnc with an discriminator network instead of using KL divergence
neat idea : a point is anomalous if either of the following are true
- point is in a dense region and has high reconstruction error compared to the data from training
- point is a lower density with respect to the training distirbution
- this gets you botht he desnity based and deteailed clasified based approaches, couldn’t any method use this?

Counterfactual Justification

Unjustified Classification Regions and Counterfactual Explanations In Machine Learning
Thibault Laugel (Sorbonne Université), Marie-Jeanne Lesot (Sorbonne Université), Christophe Marsala (Sorbonne Université), Xavier Renard (AXA, Paris), Marcin Detyniecki (Sorbonne Université; AXA, Paris; Polish Academy of Science)
they are trying to infer counterfactual explanations but being careful not to create ones wchih are not jsutified by the data
Idea : can you use decision trees to do coutnerfactual search for other ways to get the result you found?
- some data point goes down the tree to a specific leaf, but then you can infer wether taht is right, maybe it should have gone sligtly elehwere
- does this imply the tree should be different or that the leaf’s label is wrong?

Ranking

Fast and Parallelizable Ranking with Outliers from Pairwise Comparisons
Sungjin Im (University of California), Mahshid Montazer Qaem (University of California)
problem: given multiple partial ordering o felemnts/daatpoints, goal is to create a full, single consistent one
cycles are a problem, standard methods are well understand from an algorithmic sense, it is NP Hard
usually comes down to counting and minimizing the number of baward arcs from a DAg
- is it like finding the maximial DAG in a graph?
  Hierarchical Dense anomalies
CatchCore: Catching Hierarchical Dense Subtensor
Wenjie Feng (CAS Key Laboratory of Network Data Science & Technology, Institute of Computing Technology, University of Chinese Academy of Sciences),

Causal Effects Talk

Adjustment Criteria for Recovering Causal Effects from Missing Data
Mojdeh Saadati (Iowa State University), JinTian (Iowa State University)
Pearl backdoor condition, or ignorablity
- allows us to in certain cases, infer a causal effect from observational data
but what if there is msising data? or you are concerned about selection bias affecting the outcomes
they produce two criteria for evaluating only using the causal graph whether the backdoor criterion can be used

Uplift Regression

Shrinkage Estimators for Uplift Regression
Krzysztof Rudaś (Warsaw University of Technology; Institute of Computer Science, Polish Academy of Sciences), Szymon Jaroszewicz (Institute of Computer Science, Polish Academy of Sciences)
exmaple: marketing sends out a discount email
you get new data about purchases after the discount, but do any change arise because of the discounts (uplift) or in spite of them?
Impact could even be reversed, they buy less because of the dicsount coupon
they improve upon the standard double regression approach for this

Wednesday Talks

Fast Gradient Boosting

Fast Gradient Boosting Decision Trees with Bit-Level Data Structures
Laurens Devos (KU Leuven), Wannes Meert (KU Leuven), Jesse Davis (KU Leuven)
XGBoost is the most popular now
also LightGBM, look that up
idea - using full ints and floats is too detailed they use bitlevel datascturcures
existing gradient updates on decision trees Fast Gradient Boosting
- move some datapoints from one leaf to a neighbouring leaf ndoe (logically, it might not be asimple sibling)
BitRoost algorithm (theirs)
- bit representation for each leaf? with one hot dncoding
- then use the fast and/or/counts ability of bits during FGB
read this one in more detail, looks like very useful approach

Association Rules

Sets of Robust Rules, and How to Find Them
Jonas Fischer (Max Planck Institute for Informatics; Saarland University), Jilles Vreeken (CISPA Helmholtz Center for Information Security)
still useful in biology
people kept working on Apriori algorithm and found you get a lot of that for free by mining conjuctions
they are still liked becuase you get clean interpretable models andrules
problem is you get millions of rules
Grab algorithm
- heuristics to reduce the rule space
- so it seems like a smarter, more general version of Apriori algorithm

Black Box Explanation

Black Box Explanation by Learning Image Exemplars in the Latent Feature Space
Riccardo Guidotti (ISTI-CNR, Pisa), Anna Monreale (University of Pisa), Stan Matwin (Dalhousie University; Polish Academy of Sciences), Dino Pedreschi (University of Pisa)
Problems if you don’t explain
- The Husky-Wolf classifier problem - it was very good but used snow in the background for all wolf images
current explanation approaches in image clasifiers
- saliency map, showing which pixels leading to which labels
- DeepExplain and other gradient based approaches, visualize the gradient, but it is very specific to taht trained network
- prototype based methods - just show similar types of images that show you what the model is ‘thinking about’
ABELE their method

TD Actor Critic

TD-Regularized Actor-Critic Methods
SimoneParisi(presented), Voot Tangkaratt, Jan Peters, EmtiyazKhan
They are trying to deal with the instability of actor-critic methods when there is little data
The problem: using the critic in AC reduces the variance of simple gradient descent
- but the critic introduces bias and so it often overshoots good parts of the space
their approach (TDRPG) - instead of trying to fix the actor or the critic estimates, they focus on the way they interact and stabilize that
- they add a squared regularization penalty to the optimization step
- they can add their regulazer to the training for any actor critic approach

Deep Ordinal Reinforcement Learning

AlexanderZap (TU Darmstadt), Tobias Joppen (TU Darmstadt), Johannes Fürnkranz (TU Darmstadt)
problem: numerical rewards are arbitrary, changing the values can lead to completely different optimal policies
solution: use orginal rewards instead,
- map numerical rewards to a dinstinct ordered set
- scale doesn’t matter anymore, just order
issues:
- how to accumulate rewards? you don’t add them, you maintain a histogram/prob distribution of getting each reward in the state
- value function uses this prob distribtuion to choose actiopn and return the associated ordinal value
- how do you maximize it?
- sometimes you need to know the relative difference between values and this isn’t appropriate
- very interesting idea

Attentive Multi-Task Deep Reinforcement Learning

label: ECML2019,RL, skill learning

Timo Bräm (ETH Zurich), Gino Brunner (ETH Zurich)
Problem - Multi-task learning is hard
Existing approaches
- transfer learning from agents experts at prior tasks - but might forget old skills
- robust multitask RL (dilstation, teh, 2017) similar to what we are doing
Their approach - using attention…
- train on all environments at once, but maintain explicit submodules for each task during training
- can also enforce a smaller number of submodules than ther are tasks
- this forces generalization just like in autoencoders
There are multiple subnetworks to learn with but they are not explicitly designated to be for a specific task
Attention network is used to decide which subnetwork is the most relevant right now
question - how do you know it isn’t just more paramters that are helping?
experiments : grid world
criticism -
- just 10 random seeds
- domain very simple
- are they changing the rewards along the way?
- still needs lots of work

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

label: ECML2019,RL

DenisSteckelmacher (Vrije Universiteit Brussel), Hélène Plisnier (Vrije Universiteit Brussel), Diederik M. Roijers (VU Amsterdam), Ann Nowé (Vrije Universiteit Brussel)
Problem: how to learn a policy with a small number of samples without having a model
They introduce some ways to speed up learning with policy gradients to reuse prior trajectories like in clipped DQN

Policy Prediction Network: Model-Free Behavior Policy with Model-Based Learning in Continuous Action Space

label: ECML2019,RL

ZacWellmer (Hong Kong University of Science), James T. Kwok (Technology)
policy gradients, implicit RL
problem: want to do model-free, model-based combination for policy gradients
PPN : many changes and tricks compined together
use similar approach to PPO
- clipping, latent space transition models

Safe Policy Improvement with Soft Baseline Bootstrapping

label: ECML2019,RL, safeAI

KimiaNadjahi (Télécom Paris), Romain Laroche (Microsoft Research Montréal), Rémi Tachet des Combes (Microsoft Research Montréal)
to build a model of the uncertainty you can try to do MLE on the transition modle learning
problem : how to train your RL but focussed on safe outcomes rather than highest performance in a simulation
- This is defined as performing at least as well as the baseline with high probability.
existing approaches
- SOMEMETHOD -
- SPIBB [Laroche2019, ICML] - if you are not sure enough then you don’t take the action, just use the baseline instead
  - problem is they have a very binary approach to something being safe or not
Their Approach
- SPIBB but it’s soft, safe or not is not a binary choice
- They allow an error-budget to control how much error they can safely allow for that domain.

Thursday

Stochastic Activation Actor Critic Methods

Wenling Shang (University of Amsterdam-Bosch-Deltalab), Herke van Hoof (University of Amsterdam-Bosch-Deltalab), Max Welling (University of Amsterdam-Bosch-Deltalab)
Pertubation of intenral network weights to encourage exploration is well established.
but it doesn’t seem to work so well in Actor-Critic Deep RL
they add noise in clever ways to LSTMs to make it work

Compatible Natural Gradient Policy Search

Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, Gerhard Neumann
Trust region policy search - using greedy updates the entropy drops too fast
Their approach
- hard entropy constraint
- something elseq

Stochastic One-Sided Full-Information Bandit

Haoyu Zhao (Tsinghua University), Wei Chen (Microsoft Research, Beijing)
problem: repeated second price auctions
prior work:
- SODA assumes bidders are truthful, and that distritbuion of bidders if iid
- max bid does not need to be known
their approach:
- iid bidders not required
- need to know the max value of the bids
- maintain a set of all good arms : determined by empircal mean

Practical Open-Loop Optimistic Planning

Edouard Leurent (SequeL team, INRIA Lille - Nord Europe; Renault Group), Odalric-Ambrym Maillard (SequeL team, INRIA Lille - Nord Europe)
highway-env environmenta on github for highway driving of human behaviours
assume a generative model
optimisitic planning - they use this along with tree search
they recall people have found out out failing cases of UCT
- very deep branches where all the choices are bad, the optimisitc bias leads to problems
solution (lots of Munos work in 2008,2010) works for restricted classes of MDPs and works
OLOP showed that you can do UCB style planning for stochastic and deterministic MDPs the same
- but htere was no empirical vbalidation of it, so this work does that
- they show it is actually quite difference in pracic
They improve this by using the KL divergneece with OLOP

An Engineered Empirical Bernstein Bound

MarkBurgess (Australian National University), Archie C. Chapman (University of Sydney), Paul Scott (Australian National University)
Hoefding bound - it’s a concentration inequality
Empircal Bernstein bound is similar - uses sample vairnces
Bennett’s inequality is much stronger and useful, but maybe hard to use? this would give you the perfect bound possible if you knew the full variance?
they define their own EBB bound, pretty complex formulation
they show it provides a tighter bound thatn hoeffding on bernoulli bandits

MACLEAN Earth Observation Workshop

Earth Orientation Parameters Time Series

G. Okhotnikov and N. Golyandina: EOP Time Series Prediction Using Singular Spectrum Analysis
The EOP data include 5 numbers that arrive as a time series
- pole coordinates
- lenght of day
- changes in pole angles over time?
There is a service that publishes the daily values for the time series
important for navigation and satellites
SSA - https://en.wikipedia.org/wiki/Singular_spectrum_analysis is a common method used to seperate out trend, noise etc from a time series
- time series of lenght L
- embedding, create a trajectory matrix
- take SVD
- Group eigen triples together
- diagononal averaging
- then reverse the embedding
uses: allows you to extract the orignal sine wave if there was one

MvMF Loss for Prediction locations of an image on Earth’s surface

– M. Izbicki, E. Papalexakis and V. Tsotras: The MvMF Loss for Predicting Locations on the Earth’s Surface

problem how to locate the location of an image in the world just by looking at it
- easy and hard problems : eiffel tower vs inside of an apple store
data - data base of fflickr images with gps in them
their approach - Mixture of von Mises-Fisher Distritbuion
- vMF is like Gaussian distribution for spheres - (m,s)
- MvMF is a mixture of these just like a mixture of gaussians but it knows about sphere structure
- works better for anything about locating predictions on the earth but it assumes the earth is a sphere
so using MvMF as the loss function in one of these algorithm, such as Google PlaNet [Weyland 2019], makes the predcitions much smoother on the earth’s surface
other uses
- estimate range of animals and birds from people’s social media images

Deep learning for power line inspection

Invited Speaker: Prof. Robert Jenssen
25 person machine learning group at UIT in Northern Norway
they are 70 degree north, very weak magentic field there so they get the strongest northern lights
Northern Lights Deep Learning Workshop 2020 (January 20-21), about 100 people
problem: power line inspection using deep learning
- how to use drones to inspect poles in remote regions to assess if maintenance work is needed
method: few shot learning? YOLO.
other applications
- power lines
- piplines
- roads, railways
simulated data
- they generate synthetic of landscapes with power lines, high resoltuion
- tarin the detector and predictor on this
- then use the trained model for real drone flight
  Few Shot Learning
it is common for this type of domain to find new objects that were never enouctered before
ZhenCVPR2018 - Ring Loss for convex feature normalziation for face revognition
- FewShotLearning usually uses protoype points based on few data poiints
- They think this ring loss approach produces a more natural representation
- but it has some difficulat input paramters
Their approach
- their paper on improving few shot learning
- they use a class-conditional dissimilarity measure
- - they want to mesure distance between them base on the angle of the norms?
- result - goal is that all points with hte same class have the same norm
- interesting - they use different scores for distances between points in the same class and ones that are in different classes
- so it’s kind of like a normalization

Twitter LinkedIn

Computationally Thinking