sas viya

Machine Learning Tools

This is an incomplete list of all machine learning tools currently available as of July 2016. I categorized them into Open Source tools and commercial tools, however, the open source tools usually have a commercialized version with support, and the commercial tools tend to include a free version so you can download and try them out. Click the product links to learn more.

Open Source

spark-logo-trademark 

Spark MLlib

  • MLlib is Apache Spark’s scalable machine learning library.
    • Initial contribution from AMPLab, UC Berkeley
    • Shipped with Spark since version 0.8
    • Over 30 contributors
    • Includes any common machine learning and statistical algorithms
    • Supports Scala, Java and Python programming languages
  • Pros
    • Powerful processing performance of Spark. (10x faster in memory and 100x faster in hard disk.)
    • Runs on Hadoop, Mesos or Stand online.
    • Easy to code. (with Scala)
  • Cons
    • Spark requires experienced engineers.
  • Online Resources http://spark.apache.org/mllib/
  • Algorithm
  • –Basic Statistics
    • Summary, Correlation, Sampling, Hypothesis testing, and random data generation.

    –Classification and regression

    • linear regression with L1, L2, and elastic-net regularization
    • logistic regression and linear support vector machine (SVM)
    • Decision tree, naive Bayers, random forest and gradient-boosted trees
    • isotonic regression

    –Collaborative filtering/recommendation

    • alternating least squares (ALS)

    –Clustering

    • k-means, bisecting k-means, Gaussian mixtures (GMM),
    • power iteration clustering, and latent Dirichlet allocation (LDA)

    –Dimensionality reduction

    • singular value decomposition (SVD) and QR decomposition
    • principal component analysis (PCA)

    –Frequent pattern mining

    • FP-growth, association rules, and PrefixSpan

    –feature extraction and transformations

    –Optimization

    • limited-memory BFGS (L-BFGS)

scikit-learn-logo-small

Scikit-learn

Scikit-learn is a Python module for machine learning

  • built on top of SciPy
  • Open source, commercially usable – BSD license
  • Started in 2007 as a Google Summer of Code.
  • Built on NumPy, SciPy, and matplotlib

Git: https://github.com/scikit-learn/scikit-learn.git

  • Algorithms
    • classification: SVM, nearest neighbors, random forest
    • regression: support vector regression (SVR), ridge regression, Lasso, logistic regression
    • clustering: k-means, spectral clustering, …
    • decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), …
    • model selection: grid search, cross validation, metrics
    • preprocessing: preprocessing, feature extraction

 

h2o-home

H2O

  • H2O is open-source software for big-data analysis.
  • Built by a Startup H2O.ai in 2011 in Sillicon Valley.
  • Users can throw models at data to find usable information, allowing H2O to discover patterns.
  • Provides data structures and methods suitable for big data.
  • Works with cloud, hadoop, and all operating systems.
  • Written and supported Java, Python and R.
  • Graphical interface works with all browsers.
  • Website: http://www.h2o.ai 

Pandas

  • pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  • Python is good for data munging and preparation. Panda helps with data analysis and modeling.
  • Works great when combined with iPython toolkit.
  • Good for linear and panel regression. Others can be found in scikit-learn.

tensorflow

Google TensorFlow

  • Open source machine learning library developed by Google, and used in a lot of Google products such as google translate, map and gmails.
  • Uses data flow graphs for numeric computation. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them.
  • Extensive built-in support for deep learning
  • Just another library. Not the trained models or suggested algorithm for google products.
  • Cloud offering – Google Cloud ML

rstudio-ball R

  • R is a free software environment for statistical computing and graphics.
  • Pros
    • Open source and enterprise ready with Rstudio.
    • Huge ecosystem, lots of libraries and packages.
    • Runs on all operating systems, and files of all format.
  • Cons
    • Algorithm implementations varies and results are different.
    • Memory management not good. Performance worsen with more data.
  • Most used R ML Packages
    • e1071 Functions for latent class analysis, short time Fourier transform, fuzzy clustering, support vector machines, shortest path computation, bagged clustering, naive Bayes classifier
    • rpart Recursive Partitioning and Regression Trees.
    • igraph A collection of network analysis tools.
    • nnet Feed-forward Neural Networks and Multinomial Log-Linear Models.
    • randomForest Breiman and Cutler’s random forests for classification and regression.
    • caret package (short for Classification And Regression Training)
    • glmnet Lasso and elastic-net regularized generalized linear models.
    • ROCR Visualizing the performance of scoring classifiers.
    • gbm Generalized Boosted Regression Models.
    • party A Laboratory for Recursive Partitioning.
    • arules Mining Association Rules and Frequent Itemsets.
    • tree Classification and regression trees.
    • klaR Classification and visualization.
    • RWeka R/Weka interface.
    • ipred Improved Predictors.
    • lars Least Angle Regression, Lasso and Forward Stagewise.
    • earth Multivariate Adaptive Regression Spline Models.
    • CORElearn Classification, regression, feature evaluation and ordinal evaluation.
    • mboost Model-Based Boosting.

theano_logo_allblue_200x46

Theano

  • Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
    • tight integration with NumPy – Use numpy.ndarray in Theano-compiled functions.
    • transparent use of a GPU – Perform data-intensive calculations up to 140x faster than with CPU.(float32 only)
    • efficient symbolic differentiation – Theano does your derivatives for function with one or many inputs.
    • speed and stability optimizations – Get the right answer for log(1+x) even when x is really tiny.
    • dynamic C code generation – Evaluate expressions faster.
    • extensive unit-testing and self-verification – Detect and diagnose many types of errors.
  • Theano has been powering large-scale computationally intensive scientific investigations since 2007.

weka-logo
Weka

  • Waikato Environment for Knowledge Analysis (Weka) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand.
  • It is free software licensed under the GNU General Public License.
  • contains a collection of visualization tools and algorithms for data analysis and predictive modeling.
  • Weka’s main user interface is the Explorer.
  • impossible to train models from large datasets using the Weka Explorer graphical user interface.
  • Use command-line interface (CLI) or write Java/Groovy/Jython.
  • Supports some streaming.

Commercial

ml AWS Machine Learning

  • Provides visualization tools and wizards to create machine learning models.
  • Easy to obtain predictions for the built model using simple APIs.
  • Used by internal data scientist community.
  • Highly scalable, supports real-time process and at high throughput.
  • Cloud based. Pay as you go.

9564126_orig  Azure Machine Learning

  • Provides visualization tools and wizards to create machine learning models.
  • Easy to obtain predictions for the built model using simple APIs.
  • Used by internal data scientist community.
  • Highly scalable, supports real-time process and at high throughput.
  • Cloud based. Pay as you go.

aaeaaqaaaaaaaaa-aaaajgi5y2y3nmjkltrmngytndi1my1ingvmlty4yzm4mdk5ymzina
IBM Watson Analytics

  • IBM data analysis solution in the cloud.
  • Automated visualization.
  • Professional version: 10m rows, 500 columns, 100GB storage.
  • Connects to social media data.
  • Supports free form text questions about data (Google Search Box).
  • Supports easy and secure collaboration.

sasviyalogomidnight  SAS Viya

  • Cloud ready analytics and visualization architecture from the leading analytics software company.
  • Can be onsite as well.
  • Supports following SAS platforms
    • SAS Visual Analytics
    • SAS Visual Statistics
    • SAS Visual Investigators (Search)
    • SAS Data Mining and Machine Learning
  • Supports Python, Lua, Java and all REST APIs.
  • Available third quarter of 2016

 

Advertisements