Libraries used in machine learning

Vishal Khandate
10 min readJan 5, 2022

Machine Learning is the science involving programming a computer through which they can learn from different kinds of data, usually massively sized datasets. Recently, Machine Learning as a domain is rapidly expanding due to huge availability of processing power and hence various machine learning algorithms are typically used to solve various types of life problems today.

Machine learning however is not a new field. Previously in the older days, researchers performed various Machine Learning algorithms and built models by programming all the algorithms, mathematical and statistical formulae by hand. This made machine learning algorithms a huge obstacle to implement due to such a tedious and long-winded method of implementation.

However today, implementing machine learning and deep learning algorithms has become much easier and exponentially more efficient compared to the old days. This is majorly thanks to Python and the availability of various machine learning and data science libraries, frameworks and modules. Today, Python has become one of the most popular programming languages for Machine Learning, Deep Learning and Data Science thanks to these python libraries.

Some of these libraries that influenced the boom of Python in the Machine Learning sphere are:

· NumPy

· SciPy

· Scikit-learn

· Theano

· TensorFlow

· Pandas

· OpenCV

NumPy

NumPy stands for Numerical Python. It is a Python library that is majorly used for working with n-dimensional arrays in python. Apart from just array functions, it also provides various functions present in domain of linear algebra, along with Fourier transform mathematics and matrix operations.

NumPy was created in 2005 by Travis Oliphant and is an open source project. Anyone who wants can use it freely, contribute to the code base or simply view the code, thanks to it’s open source license.

Why Use NumPy?

Python does provide lists which serve the same purpose as arrays but they are not time efficient. NumPy aims to provide n-dimensional array objects that are up to 50x faster than traditional Python lists.

The main draw for NumPy is large arrays and the promise of much faster operations than with lists. The array object in NumPy is called ndarray, and it is provided with a lot of supporting functions that make working with it quite easy.

Arrays are very frequently used in data science, where speed and resources are very important. Majorly, most datasets can be represented as 2-D arrays with rows and columns and the need of faster acting array operations in a data science centered programming language becomes ever so apparent.

NumPy is also widely known and supported currently. At this point, NumPy arrays and lists can be used interchangeably (although it is not recommended to do so due to performance costs).

Scipy

SciPy stands for Scientific Python. It is a free and open-source Python library that is mainly used for scientific computing and technical computing. SciPy contains various modules for purposes of linear algebra, optimization, signal and image processing, interpolation, integration, FFT, special functions, ODE solvers and other tasks and functions that are commonly required in science and engineering.

Why Use SciPy?

One point to note is that SciPy uses NumPy for it’s operations under the hood. Thus inevitably the question arises, “If SciPy uses NumPy underneath, why can we not just use NumPy?”

The answer is that SciPy has optimized and added functions that are frequently used in NumPy and Data Science. It also has a larger and much more expansive roster of functions that we can use with the bonus of having the processing power of NumPy.

However there are a few caveats with SciPy, mainly that it has a lesser processing speed due to it being a library written in Python that builds on NumPy. NumPy is mostly written in C hence it gives faster processing speeds but SciPy adds the overhead of running it’s python code on top.

Thus if the programmer just wants more functionality but does not care more about faster performance, SciPy is the ideal pick over NumPy.

Scikit-learn

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free-of-cost software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms such as support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.

Why Use Scikit-learn ?

First of all, Scikit-learn provides amazing ease of use, providing various efficient predictive analytics tools. Moreover, due to the scikit-learn toolkit being free-of-cost and open-source, it is easily accessible to everyone.

Scikit-learn is built on Numpy, sciPy, and matplotlib libraries in Python, hence gifting it with great performance benefits. And just like the Python programming language, it is also open-source and commercially usable, so many programmers can contribute to the

Scikit-learn is used often in commercial purposes due to it’s power and it’s ease of use. Some examples of a few multinational or large companies that use scikit-learn in their operations are J.P. Morgan and Spotify. J.P. Morgan uses the Scikit-learn across all of its applications of the bank for the classification, prediction and analysis purposes. Spotify uses the Scikit-learn in a typical way as a recommendation engine works, for the purpose of generating music recommendations to provide a better user experience.

Scikit-learn Tutorial using Python

As the Scikit-learn library provides great ease of use for all the tasks of machine learning, if you are working on applications dealing with classification, regression or clustering then most of the work will be implemented mostly using this library. Let us take a small trip through building a machine learning model using Scikit-learn.

We generally start with splitting the dataset into training and test sets, here is how:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=0)

Next we import the respective model that we want to use for classification, regression, etc. And initialize it with a constructor:

from sklearn.<classifier/regression/etc. module> import <model name>
#here the respective parts in angular brackets are replaced with
#respective module in sklearn package
#Eg: from sklearn.neighbors import
#KNeighborsClassifier
model=<model name>() #here the model is initialized

Next, we train the x and y training sets on the model object using the fit method:

model.fit(x_train,y_train)

Next we predict the output of the model on the x-testing array:

y_predicted=model.predict(x_test)

Next if we desire the accuracy of the model, we could use metrics package to cross check the prediction of y against true y values:

from sklearn import metrics
accuracy=metrics.<desired metric score>(y_true,y_predicted)
#here desired metric score can be any metric like f1 score, mean squared
#error, etc. However it depends on the person developing it to figure out
#the ideal metric score

That’s it, we have built a machine learning model in just a few minutes and a few lines. This is how we can build most machine learning models using Scikit-learn.

Theano

Theano is a Python library and optimizing compiler that is mainly used for manipulation and evaluation of various mathematical expressions, but mostly matrix-valued expressions. Theano compiles computations and calculations for the purpose of running efficiently on either CPU or GPU architectures, hence it is ideal for matrix based calculations (due to its use of GPU).

Theano is called as the “grandfather of python deep learning libraries”. It has been developed since 2007 and it has influenced the making of other libraries such as TensorFlow.

Why use Theano?

Theano, as a Python library, allows us to solve operations involving multi-dimensional arrays extremely efficiently. As it uses GPU (which is essential to run larger neural networks due to huge size), it is mostly used in building various Deep Learning Projects.

Theano is also tightly integrated with NumPy, hence it can also make use of NumPy’s great performance when it wants to calculate complex expressions involving arrays on the CPU instead of the GPU.

However, GPU computations on Theano support use of float32 variables only, hence it acts as a limitation.

Theano vs TensorFlow

TensorFlow is the most widely used and famous framework for Deep Learning and thus is widely used in many research projects. However, Theano performs tasks faster than TensorFlow. However in tasks involving Multi-GPU, TensorFlow takes the lead over Theano.

Even though it performs tasks faster than TensorFlow, Theano’s compile time is slower. Moreover Theano is old as it started development in 2007 and thus it can handle computations that are much lower in complexity, whereas in TensorFlow, we can handle massive calculations.

As TensorFlow is much more recent than Theano and since Theano is very old (being developed since 2007), Theano has pretty much stagnated in development. Due to the popularity of TensorFlow, its development and updates make it a much better alternative to Theano.

Some more features

1. Automatic Differentiation–Theano automatically figures out how to gradient calculation at various points for implementation of the forward (prediction) part of the model. This allowing us to perform gradient descent for model training.

2. Speed and Stability optimisations — Theano internally reorganises and optimises various computations and calculations, for faster and more stable execution. It also compiles some operations into C code for better computation speed.

TensorFlow

TensorFlow is one of the most well known deep learning libraries as well as an essential platform for modern machine learning. It is an end-to-end open source platform and has a huge, flexible assortment of tools, libraries and community resources. It increases ease of use for data-scientists and researchers by providing various well performing models for utilization.

Why use TensorFlow?

TensorFlow uses data flow graphs to build models. It allows creation of many types of large-scale neural networks that have many layers. TensorFlow is mainly used for: Classification, Perception, Understanding, Discovering, Prediction.

TensorFlow allows users to build and train Machine Learning and Deep Learning models easily with the help of intuitive API’s like Keras. Using those, it also allows eager execution, helping immediate iterations of model and easier debugging.

As tensorflow is built out of a simple and flexible architecture, it allows researchers and data-scientists to take new ideas and concepts from paper to code, to accurate, state-of-the-art models, and to publication faster.

Pandas

Pandas is a software library written for the Python programming language and is generally used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license

Why Use Pandas?

Pandas is mainly used for data analysis, data correlation as well as data cleaning. We can import data from various file formats such as comma-separated values or CSV, JSON, SQL database tables or queries, and Microsoft Excel.

We can also use this library for cleaning data that we have read. It offers options to clear empty cells, or cells with wrong formatting or with wrong data, as well as the option to remove duplicate values for better processing.

OpenCV

OpenCV is a library for real-time computer vision as well as image processing functions. Originally developed by Intel for the programming language C++, it was later supported by Willow Garage then Itseez. Today the library is cross-platform, available in multiple languages and free for use under the open-source Apache 2 License.

Why Use OpenCV?

OpenCV increases the ease of use for Image Processing, Pattern Recognition and other various Computer Vision Operations across various languages along with Python. As mentioned, it is cross platform.

OpenCV easily allows us to analyze any object that is present in a 3D space through a camera. It also allows us to perform image-to-image transformations through various functions provided in it. Hence OpenCV is useful for both Computer Vision and Image Processing tasks.

Used along with Machine Learning OpenCV can help make very powerful tools for object recognition as well as object tagging. By itself, it can also provide functions useful for building applications involving Image encryption and decryption.

Conclusion

Today Machine Learning has developed into a vast paradigm, the traces of which can be found almost everywhere. For example in the context of regular use, Recommendation engines, Face Recognition software, etc. Immediately come to mind. Whereas in the research and development sphere, the first thing that comes to us is Python and Python Libraries.

Python has evolved from a fun tool that encourages programmer-friendliness but was impractical for performant use. It has changed into one of the most reliable, widely used programming languages in the sphere of machine learning thanks to its vast arrays of libraries, frameworks and modules for Machine Learning. This growth is similar to how Machine Learning has grown in scope and applications today from a niche subject with endless man-hours needed for implementation of algorithm to one of the most widely known and exciting fields of research today.

Contributors

Mihir Kamat
Swaraj Kalbande
Rohan Katta
Vishal Khandate

--

--