This is an incomplete list of all machine learning tools currently available as of July 2016. I categorized them into Open Source tools and commercial tools, however, the open source tools usually have a commercialized version with support, and the commercial tools tend to include a free version so you can download and try them out. Click the product links to learn more.
Open Source
 MLlib is Apache Spark’s scalable machine learning library.
 Initial contribution from AMPLab, UC Berkeley
 Shipped with Spark since version 0.8
 Over 30 contributors
 Includes any common machine learning and statistical algorithms
 Supports Scala, Java and Python programming languages
 Pros
 Powerful processing performance of Spark. (10x faster in memory and 100x faster in hard disk.)
 Runs on Hadoop, Mesos or Stand online.
 Easy to code. (with Scala)
 Cons
 Spark requires experienced engineers.
 Online Resources http://spark.apache.org/mllib/
 Algorithm
 –Basic Statistics
 Summary, Correlation, Sampling, Hypothesis testing, and random data generation.
–Classification and regression
 linear regression with L1, L2, and elasticnet regularization
 logistic regression and linear support vector machine (SVM)
 Decision tree, naive Bayers, random forest and gradientboosted trees
 isotonic regression
–Collaborative filtering/recommendation
 alternating least squares (ALS)
–Clustering
 kmeans, bisecting kmeans, Gaussian mixtures (GMM),
 power iteration clustering, and latent Dirichlet allocation (LDA)
–Dimensionality reduction
 singular value decomposition (SVD) and QR decomposition
 principal component analysis (PCA)
–Frequent pattern mining
 FPgrowth, association rules, and PrefixSpan
–feature extraction and transformations
–Optimization
 limitedmemory BFGS (LBFGS)
Scikitlearn is a Python module for machine learning
 built on top of SciPy
 Open source, commercially usable – BSD license
 Started in 2007 as a Google Summer of Code.
 Built on NumPy, SciPy, and matplotlib
Git: https://github.com/scikitlearn/scikitlearn.git
 Algorithms
 classification: SVM, nearest neighbors, random forest
 regression: support vector regression (SVR), ridge regression, Lasso, logistic regression
 clustering: kmeans, spectral clustering, …
 decomposition: PCA, nonnegative matrix factorization (NMF), independent component analysis (ICA), …
 model selection: grid search, cross validation, metrics
 preprocessing: preprocessing, feature extraction
 H2O is opensource software for bigdata analysis.
 Built by a Startup H2O.ai in 2011 in Sillicon Valley.
 Users can throw models at data to find usable information, allowing H2O to discover patterns.
 Provides data structures and methods suitable for big data.
 Works with cloud, hadoop, and all operating systems.
 Written and supported Java, Python and R.
 Graphical interface works with all browsers.
 Website: http://www.h2o.ai
 pandas is an open source, BSDlicensed library providing highperformance, easytouse data structures and data analysis tools for the Python programming language.
 Python is good for data munging and preparation. Panda helps with data analysis and modeling.
 Works great when combined with iPython toolkit.
 Good for linear and panel regression. Others can be found in scikitlearn.
 Open source machine learning library developed by Google, and used in a lot of Google products such as google translate, map and gmails.
 Uses data flow graphs for numeric computation. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
 Extensive builtin support for deep learning
 Just another library. Not the trained models or suggested algorithm for google products.
 Cloud offering – Google Cloud ML
 R is a free software environment for statistical computing and graphics.
 Pros
 Open source and enterprise ready with Rstudio.
 Huge ecosystem, lots of libraries and packages.
 Runs on all operating systems, and files of all format.
 Cons
 Algorithm implementations varies and results are different.
 Memory management not good. Performance worsen with more data.
 Most used R ML Packages

 e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier
 rpart Recursive Partitioning and Regression Trees.
 igraph A collection of network analysis tools.
 nnet Feedforward Neural Networks and Multinomial LogLinear Models.
 randomForest Breiman and Cutler’s random forests for classification and regression.
 caret package (short for Classification And Regression Training)
 glmnet Lasso and elasticnet regularized generalized linear models.
 ROCR Visualizing the performance of scoring classifiers.
 gbm Generalized Boosted Regression Models.
 party A Laboratory for Recursive Partitioning.
 arules Mining Association Rules and Frequent Itemsets.
 tree Classification and regression trees.
 klaR Classification and visualization.
 RWeka R/Weka interface.
 ipred Improved Predictors.
 lars Least Angle Regression, Lasso and Forward Stagewise.
 earth Multivariate Adaptive Regression Spline Models.
 CORElearn Classification, regression, feature evaluation and ordinal evaluation.
 mboost ModelBased Boosting.
 Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. Theano features:
 tight integration with NumPy – Use numpy.ndarray in Theanocompiled functions.
 transparent use of a GPU – Perform dataintensive calculations up to 140x faster than with CPU.(float32 only)
 efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
 speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
 dynamic C code generation – Evaluate expressions faster.
 extensive unittesting and selfverification – Detect and diagnose many types of errors.
 Theano has been powering largescale computationally intensive scientific investigations since 2007.
 Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.
 It is free software licensed under the GNU General Public License.
 contains a collection of visualization tools and algorithms for data analysis and predictive modeling.
 Weka’s main user interface is the Explorer.
 impossible to train models from large datasets using the Weka Explorer graphical user interface.
 Use commandline interface (CLI) or write Java/Groovy/Jython.
 Supports some streaming.
Commercial
 Provides visualization tools and wizards to create machine learning models.
 Easy to obtain predictions for the built model using simple APIs.
 Used by internal data scientist community.
 Highly scalable, supports realtime process and at high throughput.
 Cloud based. Pay as you go.
 Provides visualization tools and wizards to create machine learning models.
 Easy to obtain predictions for the built model using simple APIs.
 Used by internal data scientist community.
 Highly scalable, supports realtime process and at high throughput.
 Cloud based. Pay as you go.
 IBM data analysis solution in the cloud.
 Automated visualization.
 Professional version: 10m rows, 500 columns, 100GB storage.
 Connects to social media data.
 Supports free form text questions about data (Google Search Box).
 Supports easy and secure collaboration.
 Cloud ready analytics and visualization architecture from the leading analytics software company.
 Can be onsite as well.
 Supports following SAS platforms
 SAS Visual Analytics
 SAS Visual Statistics
 SAS Visual Investigators (Search)
 SAS Data Mining and Machine Learning
 Supports Python, Lua, Java and all REST APIs.
 Available third quarter of 2016
Advertisements