This is a real example from CERN competition, hosted on Kaggle (ended Oct 2015). It asks to find $τ → 3μ$ (Lepton flavor violation if detected would indicate new physics beyond the SM). Some of the labeled data are artificially made, but the successful learning models from the competition can be used for real detections. See this technical CERN report.
The 1st place term solution published on the Kaggle blog and the github showed they used "physical intuition + xgboosts + ANN". The tree boost method was used by some winning terms in the last contest (actually xgboosts was popular in many other contest too), which was searching for Higgs boson (ended Sept 2014). Reference "Higgs Boson Discovery with Boosted Trees" and "The Higgs boson machine learning challenge"
This 3-year-old video by David Hogg (NYU) (April 2013) reflects how much things have changed in term of the adoption of ANN in the science community. Thanks to tensorFlow (Google), Torch (NYU, facebook). ANN is a lot easier to use now!
from IPython.display import YouTubeVideo
YouTubeVideo("aA3qdegi8Vw", start=564,end=642, autoplay=0,
theme="light", color="white", rel=0,
height=400, width=500,showinfo=0)
'''
My notes on boosting discussed the main difference between tree
bagging (random forest) and tree boosting. Essentially one is
variance reduction and the other increasing accuracy, but prone
to overfitage.
We can also combine the two
the following code were copied from
https://www.kaggle.com/benhamner/flavours-of-physics/
rf-xgboost-example/code
'''
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
print "Train a Random Forest model"
rf = RandomForestClassifier(n_estimators=100, random_state=1)
rf.fit(train[features], train["signal"])
#train[features] are features,
#train["signal"] = 0/1 labels for back/foreground
print "Train a XGBoost model"
params = {"objective": "binary:logistic",
"eta": 0.3,
"max_depth": 5,
"min_child_weight": 3,
"silent": 1,
"subsample": 0.7,
"colsample_bytree": 0.7,
"seed": 1}
num_trees=250
gbm = xgb.train(params, xgb.DMatrix(train[features],
train["signal"]), num_trees)
#Combining tree boost and random forest
print "Make predictions on the test set"
predictions = (rf.predict_proba(test[features])[:,1] +
gbm.predict(xgb.DMatrix(test[features])))/2
'''
Walk through 1st place term technical Write-Up on the blog,
find out that they first create new attribute--new mass,
which turns out to be highly correlated to the measured mass
in the train set
the following codes were copied from their github solution
https://github.com/aguschin/flavours-of-physics/blob/
master/AssemblingFinalSolution4.ipynb
'''
for tt in train,test, check_agreement, check_correlation:
tt['p0_pz'] = (tt.p0_p**2 - tt.p0_pt**2)**0.5
tt['p1_pz'] = (tt.p1_p**2 - tt.p1_pt**2)**0.5
tt['p2_pz'] = (tt.p2_p**2 - tt.p2_pt**2)**0.5
tt['pz'] = tt.p0_pz + tt.p1_pz + tt.p2_pz
tt['p'] = (tt.pt**2 + tt.pz**2)**0.5
tt['speed'] = tt.FlightDistance / tt.LifeTime
tt['new_mass'] = tt.p / tt.speed
features_m = features + ['new_mass']
'''
cross-validation was used throughout.
The mass term was missing in the testing set, so
they used xgboost to predict mass, called
new_mass2 in the training set also, and add two more new features
'new_mass_delta', 'new_mass_ratio'
then they used nolearn.lasagne for neural net + parallelism
'''
import nolearn
from lasagne.layers import DenseLayer, InputLayer, NINLayer
from lasagne.nonlinearities import softmax, rectify, sigmoid
from lasagne.updates import rmsprop, adagrad, sgd, adadelta
from nolearn.lasagne import NeuralNet
net = NeuralNet(layers=layers0,
input_shape=(None, Xtrain.shape[1]),
dense1_num_units=8, # 1 hidden layer with 8 neurons
output_num_units=2,
output_nonlinearity=softmax, #then softmax layer
update=adagrad,
update_learning_rate=0.01,
eval_size=.005,
verbose=1,
max_epochs=3000)
net.fit(Xtrain, Ytrain) ;
'''
in summary from their codes we learn that xgboost is about
95-98% accurate with varied numbers of trees
and NN is about 99.8% accurate with k-ford (k=5)
then they blended everything together
They didn't show how they got the final boosting formula, but
looked like it was from other NN to calculate the weights
Finally they got 0.999997749873 accuracy
'''
'''
One can replace the nolearn.lasagne part in their solution
by tensorFlow logistic regression,
it runs in parallel and its accuracy is about 99.2%.
reference
www.tensorflow.org/versions/0.6.0/tutorials/mnist/pros/index.html
'''
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
#the reason tensorFlow need placehodler is because
#it runs in parallel. It needs to assign a batch of
#matrices to different processors.
x = tf.placeholder(tf.float32, [None, 45]) #number of features
y_ = tf.placeholder(tf.float32, [None, 2])
W = tf.Variable(tf.zeros([45, 2]))
b = tf.Variable(tf.zeros([2]))
py_x = model(X, w)
y = tf.nn.softmax(tf.matmul(x,W) + b)
# Define loss and optimizer then train
cross = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(py_x, Y))
train_op = tf.train.GradientDescentOptimizer(0.5).minimize(cross)
predict_op = tf.argmax(py_x, 1)
sess.run(tf.initialize_all_variables())
for i in range(1000):
batch = xtrain.next_batch(50) #batch size
train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5})