Links

From Olav Laudy Data Science
Jump to: navigation, search

Contents

Links


Topological Data Analysis and Machine Learning reinforce each other. Good read on how to visualize!

http://www.ayasdi.com/blog/bigdata/how-tda-and-machine-learning-enhance-each-other/

This is how to do large scale data science!

http://www.unofficialgoogledatascience.com/2015/09/causal-attribution-in-era-of-big-time.html?m=1

[must read]. Fantastically clear post on vector models and k-nn.

http://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1/

[Python] Great tutorial on word2vec and doc2vec

http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis

Great intro on Random Forests in Python and R

http://www.analyticsvidhya.com/blog/2015/09/random-forest-algorithm-multiple-challenges/

Continue reading this after you read the post on calculus on computational graphs.

http://outlace.com/Computational-Graph/

Nicely explained: model based clustering, with examples in R.

http://exploringdatablog.blogspot.com/2011/08/fitting-mixture-distributions-with-r.html?m=1

[must read] Calculus on computational graphs.

http://colah.github.io/posts/2015-08-Backprop/

[R] Simple tutorial on building word clouds

http://datascienceplus.com/building-wordclouds-in-r/

If you always wanted to understand LSTM's, this is your chance!

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Light intro in some concepts of optimization.

http://horicky.blogspot.sg/2015/08/common-techniques-in-optimization.html

[A good read] Controlling for confounding variables.

http://janhove.github.io/design/2015/08/24/caveats-confounds-correlational-designs/

Nice intro to multilevel models - linear mixed models - random effect models in R.

http://datascienceplus.com/analysing-longitudinal-data-multilevel-growth-models-i/

Fairly tech paper, but the technique is new and fascinating! This could well be the next generation of models.

http://arxiv.org/abs/1503.03585

Good tips on advancing in data science.

http://machinelearningmastery.com/techniques-to-understand-machine-learning-algorithms-without-the-background-in-mathematics/

Read this extremely clearly written article on Generalized Additive Models (gams) + how to do it in R.

http://multithreaded.stitchfix.com/blog/2015/07/30/gam/

Very nice overview of the basic data mining algorithms with R and Python code.

http://www.analyticsvidhya.com/blog/2015/08/common-machine-learning-algorithms/

[mindblowing piece of video] how the brain does backpropagation

https://youtu.be/kxp7eWZa-2M?t=38m13s

Cool article about Bayesian optimization of hyper parameters with Gaussian processes.

http://betatim.github.io/posts/bayesian-hyperparameter-search/

Interesting take on the maturity of the different categories of artificial intelligence.

http://insights.venturescanner.com/2015/08/06/which-artificial-intelligence-category-is-most-mature/

Just for fun] 5 cool and unusual datasets to play around with

http://www.bytesandstitches.com/blog/5-weird-and-wonderful-data-sets-you-can-use/

Great post on matrix factorization and the relation between k-means and PCA.

http://joelcadwell.blogspot.com/2015/08/matrix-factorization-comes-in-many.html?m=1

Survival analysis tutorial in R.

http://mathminers.com/index.php/2015/08/06/survival-analysis-in-r-step-by-step-guide/

Simple tutorial on deeplearning with the Keras framework

http://smerity.com/articles/2015/keras_qa.html

Nice overview of Watson trade-off analytics.

https://developer.ibm.com/watson/blog/2015/07/29/what-makes-watson-tradeoff-analytics-is-so-different/

Great read on recommendation systems and the technicalities behind the Netflix challenge.

http://www.mmds.org/mmds/v2.1/ch09-recsys2.pdf

Great article on Generalized Additive Models

http://multithreaded.stitchfix.com/blog/2015/07/30/gam/

15 Questions about plots in R.

http://blog.datacamp.com/15-questions-about-r-plots/

Good post on preventing model leakage, illustrated by a cross validation example in Python.

http://www.alfredo.motta.name/cross-validation-done-wrong/

[What a jewel!] If you want to get into deep learning, read this extremely accessible book.

http://neuralnetworksanddeeplearning.com

Great article on word embeddings

http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/

Nice and well formulated tutorial on the R functions Apply, Mapply and Sapply.

http://blog.datacamp.com/r-tutorial-apply-family/

Nice data science competition model write up

https://medium.com/@nickrgmills/what-i-learned-from-my-first-data-science-competition-a00cadcba52a

Great read on detecting fraud in online games

https://www.snellman.net/blog/archive/2015-07-22-cheater-detection-in-async-online-game/

Just because you can: R and the location of letters in words.

http://www.56n.dk/where-do-letters-occur-in-words/

Some good insights in feature creation using machine learning models.

http://blog.kaggle.com/2015/07/24/taxi-trip-time-winners-interview-3rd-place-bluetaxi/

Humor that only data scientists make smile.

http://www.oneweirdkerneltrick.com

Free data science trainings on the web.

http://enterprise.import.io/post/the-best-free-courses-on-the-web-for-becoming-a-data-science-master/

[R code] Simple example: intro to gradient descent by deriving it for a linear model.

http://alexhwoods.com/2015/07/19/guide-to-linear-regression/

Well explained and useful intro introduction to Graph databases with an application of building a recommendation algorithm.

https://medium.com/@keithwhor/using-graph-theory-to-build-a-simple-recommendation-engine-in-javascript-ec43394b35a3

Stop hiring data scientists until ready!

http://www.kdnuggets.com/2015/07/stop-hiring-data-scientists-until-ready.html#.VakaYm7e5uM.linkedin

A great set of data science tutorials on Git (including an explanation of Git hub)

http://www.analyticsvidhya.com/blog/2015/07/github-special-data-scientists-to-follow-best-tutorials/

Nice illustration of decision boundaries for various machine learning models.

http://freakonometrics.hypotheses.org/20002

The ultimate data science cheatsheet collection

http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html

Very useful tutorial on how to use Git with R.

http://www.r-bloggers.com/rstudio-and-github/

Readable article on deep neural networks for vision and the recent ability for these networks to 'dream'.

http://engineer.abeja.asia/?p=173

Nice small writeup on an R model for a Kaggle competition.

http://www.analyticsvidhya.com/blog/2015/07/top-10-kaggle-fb-recruiting-competition/

Very useful R viz cheatsheet

http://api.ning.com/files/DS3fBDgklopO4ylKMMGD-9Ompid7jrCt0MdE0iB4SXDPopgiaBYQrQF8j-rT0JLN-0hwM8WAYcwm*J83sNCx8YssrIajdoqO/Capture1.PNG

[Technical] Tough read on uncertainty in deep learning models, but well worth it.

http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

First steps: getting started with SparkR.

http://pingax.com/sparkr-with-rstudio-ubuntu-12-04/

Practical guide to visualize high dimensional data

http://blog.applied.ai/visualising-high-dimensional-data/

Play with this tool to show how deep neural nets 'dream'

https://github.com/google/deepdream/blob/master/dream.ipynb

Fantastic practical insight in modelbuilding

http://blog.kaggle.com/2015/05/07/profiling-top-kagglers-kazanovacurrently-2-in-the-world/

Great article about the difference between machine learning and statistical modeling.

http://www.analyticsvidhya.com/blog/2015/07/difference-machine-learning-statistical-modeling/

Important article that discusses how to visualize what a deep neural network learns.

http://arxiv.org/pdf/1506.06579v1.pdf

Nice Kaggle coding walkthough.

http://www.analyticsvidhya.com/blog/2015/06/solution-kaggle-competition-bike-sharing-demand/

Fantastic to see how neural networks are equipped with episodic memory to give power of reasoning.

http://arxiv.org/pdf/1506.07285v1.pdf

Understanding boosting, with nice vizualizations

http://www.r-bloggers.com/an-attempt-to-understand-boosting-algorithms/

Great way to explain complexity of an algorithm.

http://algosaur.us

Informative read on a model building journey.

http://www.r-bloggers.com/kdd-cup-2015-the-story-of-how-i-built-hundreds-of-predictive-models-and-got-so-close-yet-so-far-away-from-1st-place/

Quick walkthrough of machine learning models and deep learning.

http://www.slideshare.net/mobile/TerryTaewoongUm/introduction-to-machine-learning-and-deep-learning

[Nerd] Just because it's funny

http://www.cafepress.com/mf/96703427/i-support-vector-machines_tshirt?productId=1502159657

Nice showcase of modeling on Spark: Word2Vec & Gradient Boosting Machines.

http://h2o.ai/blog/2015/06/ask-craig-sparkling-water/

Insightful post on modeling human behavior.

http://joelcadwell.blogspot.com/2015/06/looking-for-preference-in-all-wrong.html?m=1

This is how Facebook knows who you are, even without seeing your face.

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Zhang_Beyond_Frontal_Faces_2015_CVPR_paper.pdf

Inspiring and insightful interview with Top Kaggler

http://blog.kaggle.com/2015/06/22/profiling-top-kagglers-owen-zhang-currently-1-in-the-world/

Large scale flash memory failures: a good read into analytics at work to understand life cycle of hardware components. What is missing, is the forward looking part: can you see how to include that?

http://users.ece.cmu.edu/~omutlu/pub/flash-memory-failures-in-the-field-at-facebook_sigmetrics15.pdf

If you have nothing better to do today, analyze this terabyte dataset with your favorite Click Through Rate models.

http://labs.criteo.com/downloads/download-terabyte-click-logs/

Read up on ROC and AUC with an application of predicting the number of deaths from the Titanic. (soo useful).

http://www.r-bloggers.com/illustrated-guide-to-roc-and-auc/

Interesting read on hyper-parameter optimization.

https://medium.com/@D33B/smarter-parameter-sweeps-or-why-grid-search-is-plain-stupid-c17d97a0e881

The never ending possibilities of the neural network: teaching the computer to have conversations.

http://arxiv.org/pdf/1506.05869v1.pdf

Simple (technical) read on Random forest

http://www.r-bloggers.com/variable-importance-plot-and-variable-selection/

Brilliant article on R in the IBM Cloud.

http://www.ibm.com/developerworks/library/ba-bluemix-trs-predictive-analytics-with-dashdb/index.html

Cool automation in R

http://www.r-bloggers.com/connecting-r-to-everything-with-ifttt/

This is how machine translation works. Cool stuff!

http://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-2/

[YouTube+code] Neural network evolves to play Super Mario World.

https://www.youtube.com/watch?v=qv6UVOQ0F44

Fascinating result! Machine learning method beats humans in verbal comprehension questions IQ test. Technical paper.

http://arxiv.org/pdf/1505.07909v1

Mapping example in R, good example code, with an application to crime analytics.

http://www.r-bloggers.com/introductory-point-pattern-analysis-of-open-crime-data-in-london/

A must read on model mixing! Well written, lots of examples, and not available in such collection and overview in literature.

http://mlwave.com/kaggle-ensembling-guide/

Basic R: getting familiar with data frames. An easy to follow and well illustrated tutorial.

http://www.r-bloggers.com/15-easy-solutions-to-your-data-frame-problems-in-r/

Interesting, non-technical, read on a recommendation system for an online retailer.

http://www.www2015.it/documents/proceedings/companion/p1269.pdf

How to become a data scientist: a nice guide with lots of detail.

http://www.mastersindatascience.org/careers/data-scientist/

This makes me smile: logistic regression to find out the value of chess pieces.

http://www.r-bloggers.com/big-data-and-chess-what-are-the-predictive-point-values-of-chess-pieces/

Clearly written article on A/B testing and proving your analytical model by setting up an experiment.

http://blog.dato.com/how-to-evaluate-machine-learning-models-the-pitfalls-of-ab-testing

I love the thinking! This is what we need to do more in data science.

http://singularityhub.com/2015/06/08/how-to-disrupt-yourelf-with-moonshot-thinking-and-unholy-alliances/

Insightful paper on characteristics of fraud that are detectable in data by using analytics.

http://info.neo4j.com/rs/neotechnology/images/Fraud%20Detection%20Using%20GraphDB%20-%202014.pdf

Good hints: speeding up your R code

http://rstatistics.net/strategies-to-speed-up-r-code/

Tuning the parameters of your Random Forest model

http://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

Simple introduction to text mining: bag of words and term frequency / inverse document frequency (TF-IDF)

http://fastml.com/classifying-text-with-bag-of-words-a-tutorial/

A useful pointer: lessons learned in high-performance R.

http://www.r-bloggers.com/lessons-learned-in-high-performance-r/

Awesome paper from Google about the prediction of energy efficiency in their data centers. Well written, includes some examples how the predictions can be used to make the datacenter more efficient.

https://docs.google.com/a/google.com/viewer?url=www.google.com/about/datacenters/efficiency/internal/assets/machine-learning-applicationsfor-datacenter-optimization-finalv2.pdf

Long, but worth the read: Hofstadter, the author of Godel, Escher and Bach on intelligence, AI and machine learning.

http://hardforkit.com/articles/the-man-who.html

Simple intro read into the top 10 data mining algorithms. The real trick is to start using them :)

http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-english

Excellent series on the working of various machine learning models by understanding their decision boundery, shown in simple R code.

http://tjo-en.hatenablog.com/entry/2015/03/11/200000http://tjo-en.hatenablog.com/entry/2015/03/20/191614http://tjo-en.hatenablog.com/entry/2015/04/02/190000http://tjo-en.hatenablog.com/entry/2015/04/20/190000http://tjo-en....more

AirBnb rocks! Check out the nerd section on their home-grown modeling tool (and check out the article on handling missing data in Random Forests).

http://nerds.airbnb.com/airflow/http://nerds.airbnb.com/overcoming-missing-values-in-a-rfc/

Simple Random Forest explanation and coding example.

http://tjo-en.hatenablog.com/entry/2015/06/04/190000

An overlooked area in Machine Learning: prediction intervals

http://blog.datadive.net/prediction-intervals-for-random-forests/

Great read to expand your intuition on high dimensional spaces

http://isomorphism.es/post/120539470124/hacks-for-thinking-about-high-dimensional-space

Insightful tips to improve your model

https://medium.com/@D33B/7-ways-to-improve-your-predictive-models-753705eba3d6

Nice overview of the different data scientist skills.

http://dataconomy.com/the-22-skills-of-a-data-scientist/

How to keep your data scientists: I like the first point (the other points make sense too)

http://channels.theinnovationenterprise.com/articles/how-to-keep-your-data-scientists

Great article on how unstructured data became important to data science

http://www.hadoop360.com/blog/how-nosql-fundamentally-changed-machine-learning

Thought provoking examples on how different datasets give rise to the same regression equations

http://en.m.wikipedia.org/wiki/Anscombe%27s_quartet

Good post on interpreting categorical regression coefficients

http://www.r-bloggers.com/using-and-interpreting-different-contrasts-in-linear-models-in-r/

Interesting post on A/B testing and the need for statistical sound criteria

http://kadavy.net/blog/posts/aa-testing/

Questions from Data Science interviews

http://blog.udacity.com/2015/04/data-science-interview-questions.html

A simple post on model evaluation. Specially the picture is useful to explain this process to the business

http://www.datasciencecentral.com/m/blogpost?id=6448529:BlogPost:275475

A great post on handling Twitter responses using R

http://www.r-bloggers.com/analyzing-r-bloggers-posts-via-twitter/

A good showcase for meta modeling; 3 three layer model is build from many model ensembles

https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/14335/1st-place-winner-solution-gilberto-titericz-stanislav-Semenov

How to filter out relevant predictors for your model

https://www.knime.org/blog/seven-techniques-for-data-dimensionality-reduction

Practical article on K-means clustering

http://bicorner.com/2015/05/26/k-means-clustering-using-r/