This comprehensive guide on Machine Learning with Python brings you the detailed discussion on using Python for ML. We start with the basics of Python and the importance of using it. Thereafter we give an overview of machine learning and things you need to learn before using Python for ML. We also give a step-by-step tutorial to work on it.
Table of contents
Python has been gaining popularity amongst developers in recent times. It came into the scene in the 1990s, post which the platform has been updated regularly over the years. Today, we will talk about Machine Learning With Python In Detail.
The version which we see now is the most powerful, yet simple to comprehend and work with. Let us see what exactly is Python.
According to a slash data report, 8.2 million developers are using Python, while 7.6 million use Java. Thus, Python’s importance can no way be undermined, and numbers to tell a similar story.
If you are a fresher, you need to pick a language that will keep you in track of career targets. You cannot just randomly pick a language and master it.
However, if you are in your current job, going out of the way to learn a new programming language will give your career an added boost.
The prime example for this would be if you are looking to get into the world of Machine Learning and Artificial intelligence, then learning Python is the best move you can make.
Some of the properties which make Python a must-learn language are mentioned below.
The idea behind using simple syntax is assigning more time to solve actual execution problems rather than spending time on debugging syntax errors.
Just think about the time you can save, which can be channelized into other productive areas.
Python is a programming language that provides you the freedom to work with syntax that is simple to apply in the script and easy to read.
For instance, Python does not use ‘typedef’ to assign a variable. Also, it doesn’t use curly braces and semicolons.
First of all, Python is open-source, so you have some of the best developers in the community to help you in various situations. Add to that regular workshops, conferences, and community meet helps a beginner to feel at home.
There are plenty of resources online to learn about Python. There are various blogs, and youtube channels, which regularly keep the aspiring developers updated about recent changes and provide useful tips and suggestions.
Python arms people with a rich framework and libraries, which makes working on various projects easy. The robust library has modules for implementing various functionality without actually having to manually code them.
Python works well with Mac OS, Windows, and Linux. It is also extremely portable and dynamic as it allows developers to run code in different operating systems.
In case you need full in-depth details about the Python library, you can visit the official website and access the documentation.
Having mathematical skills is an indispensable part of working with Python development Now, it does not mean that you need to have a master’s degree in mathematics, but a sound concept clarity on key topics will come in handy.
Now, let us see what is the role of mathematics in Machine Learning. To put it simply, mathematics is a tool to develop a model to analyze data.
Using the apt mathematical approach, the ML algorithms can be designed to collect information that is required from large volumes of data fed into the system.
So, having a mathematical aptitude helps you with the data set more efficiently by algorithms that do a specific task to the ‘T’.
Linear Algebra
In the linear algebra section, you need to learn about various elements like vectors and scalars. You will also have to learn about the matrix.
Though it is not directly connected to the algebra, you need to know how to switch between the linear equations and matrix. This will help in Machine Learning With Python.
Get the essential knowledge first. For instance, while working on regression, you need to have a stronghold over matrix multiplication.
So, amidst all of the different things that you can learn in the matrix, try to first acquaint yourself with the multiplication. Do not miss out on learning the basics operation like matrix addition.
Now, for working on PCM (Principal Component Method) of data analysis, you need to be familiar with the concept of EigenVectors.
Principal Component Analysis: It is a dimensionality reduction method. Now what this means is that with PCM, large sets of data with ‘n’ number of variables are reduced in such a fashion that important core elements are not lost.
Eigenvectors: For PCM, you need to have an understanding of Eigenvectors. These are vectors that do not lose sensitive parts of the data when their shape is transformed. Such vectors hold the important data set for accurate analysis.
Multivariate Calculus
The second line of mathematical knowledge that you need to possess while working with the Machine Learning model is Differentiation.
Again, you don’t need to go in-depth with calculus stuff. For instance, in ML you will be mainly dealing with the first-order derivative. So learning higher-order derivatives won’t be a great idea.
Things you need to learn in derivative, especially in the first-order form, include the chain rule, sum rule, power rule, and various other basic differentiation rules. Also, you need to be familiar with the topic of partial differentiation.
But the question is why do you need to study differentiation? Well, the short answer to that would be; differentiation helps to optimize machine Learning models for working with data in a better fashion.
How does it work?
When you differentiate a vector once, it becomes a Jacobian vector. Such vectors come in handy to recognize the point in the global data set where maximum data is available.
One of the properties of Jacobian vectors is that such vectors can easily be presented in a matrix. Also, the Jacobian vector helps to transform a nonlinear function into a linear one.
Now when you differentiate the Jacobian vector again, you get something called Hessian, which helps in reducing errors and assists the gradient method to reduce the load. We look into the Gradient Descent in much more detail in the next section.
Gradient Descent
Gradient Descent is another important part of machine learning dealing with neural networks, and derivatives. Without apt mathematical knowledge, it’s tough to understand the idea of gradient descent.
Carving out neural networks from scratch is the best way to come close to the mathematics used in mathematical analysis.
There are various resources online which will help you to learn how to build a neural network from scratch.
Here, you don’t need to get into the flesh of neural networks, just a basic understanding coupled with following guidelines and writing codes would suffice.
Gradient Descent is a process to minimize the cost function to find the local minimum. One of the other ways to understand gradient descent is that it helps to optimize weight.
Just a plain example would be reaching an output of 0.3 from an input of 0.1. Here, we continuously work around the range to optimize weight to reach a target close to 0.3.
There are two ways to look at the importance of learning Python Machine Learning syntax.
According to the first line of thought, you just can’t avoid learning the syntax in the first place, neither can you just read the documentation and get the hang of it.
You need to employ a combination of practice and learning to get a stronghold over syntax.
As per the second line of thought, you don’t need to go deeper into the syntactic knowledge of Python, while working with machine learning.
A lot of complex elements can be easily skipped when acquiring the Python syntax knowledge for machine learning applications. So, learning the basic syntax is the key.
While in the offline section, we will mention the books that can help you in the process of learning Python.
There are various online learning sites that you can resort to for learning Python. They provide complete and comprehensive knowledge on the topic. Let’s see some online sites that provide resources for Python.
First in the list of online resources is Codecademy, which is one of the best online resources to learn any form of programming language. It is an apt platform for both an expert or a rookie.
DataQuest is yet another important source online to learn Python. It has a unique way of bringing forth the knowledge of Python.
Dataquest has pooled together data science & Python, the latter is taught in context to the former (data science).
None of the online resources can replace the official documentation present in Python’s website.
The tutorial videos which are present in the web portal are also a great source to learn different elements of Python like the working of syntax.
Learning Python The Hard Way is an excellent resource to learn about Python and understand its intricacies moving level by level up the ladder to learning Python.
The author, Zed Shaw, creates an atmosphere of active learning for readers by guiding them to other resources that they can dig up while reading this book.
As mentioned earlier, Python has numerous frameworks and libraries, which can make the task of working with data much simpler.
A library simply consists of objects and functions which can be imported into the script for reducing time and getting better results.
Executing a task sometime might require huge coding work, but with objects and functional elements present in the library, the numerous code lines can be shrunk into a single line.
Before heading into python libraries, you must understand what is Jupyter Notebook, and the way you can use it. It will help you in Python Machine Learning.
So, Let Us First See What Is Jupyter Notebook?
A Jupyter notebook is a web application that has three crucial components. First, at the top, you have space for text which serves the explanatory purpose.
Next, you have the space for the live execution of code. And finally, you can embed videos for better visualization along with graphs.
Generally, the installation package of Python is innately bundled with the Jupyter Notebook.
Below is the instruction on how you can use the Jupyter Notebook to know more about the libraries and use them in a better way.
Steps to leverage Jupyter Notebook to understand Python library:
We have compiled a list of open-source libraries based on the functionalities, which are divided into two parts: Data processing & Modeling, and Visualisation.
One of the innate benefits of Python is that it makes working with arrays easy and effective. This innate benefit is taken forward with the Numpy, short for Numerical Python, as it eases out the process of working with arrays and matrices.
Numpy takes care of the different mathematical operations on arrays, as it expedites the process of operations.
Further, with Numpy, several other functionalities come to the fore like Numeric conversion and different operations based on linear algebra.
When you work with data in Python, you gotta know all about the pandas. There is hardly anything that you can skip in the pandas section, and move forward.
Pandas provide a plethora of tools to work with data. You can shape data in any form, you can even add or remove data. Every change made into the data frame is almost immediately reflected.
Scikit learn is one of the most popular Machine Learning Python libraries which was initially named Scikits. Learn. It was later called just Scikit learn, where Scikit is the compressed form ‘SciPy Tool Kit’.
Data scientists across the globe mainly use Scikit learn to comply with various machine learning and data mining tasks such as dimension reduction, model selection, regression, and clustering.
Matplotlib is a Python library that helps comprehend data in a better fashion using the graph representation.
It helps in the visual representation of large chunks of data in graphs or some other visual form.
The general visual representation can be brought to the fore with minimum-to-no code adjustments.
However, if you want some advanced graphical representation, you need to be prepared to code a few extra lines.
By now you must have known why Python is best for Machine Learning. Now, to integrate it in your project, here is a step-by-step tutorial that you can follow and get a hands on it.
The very first and obvious step to start working with Machine Learning Python is to have the Scipy and Python platforms installed in your system. If you are a developer, you would know how to do the needful.
This step by step guide is attuned with the Python v2.7 or 3.6+.
Here you need to begin with installing five crucial Python-based libraries. The list of which includes: SciPy, Pandas, Numpy, Sklearn, and Matplotlib.
Installing all the SciPy libraries is quite easy on different platforms like Mac OS X, Windows, and Linux. The ease in installation comes from the clear instruction mentioned in the SciPy installation pages. Python for Machine Learning can be a great combo.
To put it simply, for Mac OS X, using Macports is an easy way to comply with the installation task. For Linux, you need to use the Package manager. Finally, for Windows, you can begin with downloading Anaconda, which has all the libraries.
It is crucial that your Python version and the environment setup is correctly installed. For this, you can use the script mentioned below. Also, start working on the Python command line, instead of using IDEs.
Use the following script to test the version.
# Check the versions of libraries # Python version import sys print('Python: {}'.format(sys.version)) # scipy import scipy print('scipy: {}'.format(scipy.__version__)) # numpy import numpy print('numpy: {}'.format(numpy.__version__)) # matplotlib import matplotlib print('matplotlib: {}'.format(matplotlib.__version__)) # pandas import pandas print('pandas: {}'.format(pandas.__version__)) # scikit-learn import sklearn print('sklearn: {}'.format(sklearn.__version__))
Python: 3.6.9 (default, Oct 19 2019, 05:21:45) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] scipy: 1.3.1 numpy: 1.17.3 matplotlib: 3.1.1 pandas: 0.25.1 sklearn: 0.21.3
Also, keep in mind the APIs do not change too frequently, so if you have a version from the recent past, it should work fine.
Here we will be using the ‘Hello World’ alternative for the dataset in Machine Learning, it is the iris dataset.
Now, the given dataset has information about 150 iris flowers. The dataset includes several features of the different species of the iris flower. This is an essential step in the Python Machine Learning project.
At this point, we will import every element of libraries: function, object, modules. In an ideal scenario, every element should load without any fuss.
# Load libraries from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC
The data set can be directly pulled out from the UCI ML repository. Here, Pandas are used for working data and even visualizing the same. Each column is named while loading data, so that exploring data at later stages can be easier.
# Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names)
Here, we will work through the data and gather insights for summarising them according to certain parameters.
We will assess the data from different angles. Also, while assessing the data for each section, you have to use a single command.
This will prepare you for further command-based steps in the project that you deal with in the near future.
Here we use the #shape property, to ascertain the number of rows (instances) and columns (attributes).
# shape print(dataset.shape)
Now, according to our iris dataset, you should receive an output inferring to 150 rows or instances, and 5 attributes or columns.
(150, 5)
It is advised to have a peek at the data, to begin with.
# head print(dataset.head(20))
In an ideal situation, you must glance through at least 20 rows of the given dataset.
sepal-length sepal-width petal-length petal-width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa 10 5.4 3.7 1.5 0.2 Iris-setosa 11 4.8 3.4 1.6 0.2 Iris-setosa 12 4.8 3.0 1.4 0.1 Iris-setosa 13 4.3 3.0 1.1 0.1 Iris-setosa 14 5.8 4.0 1.2 0.2 Iris-setosa 15 5.7 4.4 1.5 0.4 Iris-setosa 16 5.4 3.9 1.3 0.4 Iris-setosa 17 5.1 3.5 1.4 0.3 Iris-setosa 18 5.7 3.8 1.7 0.3 Iris-setosa 19 5.1 3.8 1.5 0.3 Iris-setosa
Here, we will gather information from the attributes based summary record. The list includes min & the max, count, percentile in some cases, and finally mean.
# descriptions print(dataset.describe())
Below, as you see the numerical values in centimeters, ranging between 0-8 cm.
sepal-length sepal-width petal-length petal-width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.054000 3.758667 1.198667 std 0.828066 0.433594 1.764420 0.763161 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000
Here, we go ahead with the class distribution for getting the absolute value for the number of instances each class has.
# class distribution print(dataset.groupby('class').size())
As you can see below, each class has equal instances i.e. 50.
class Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50
Below, we have mentioned an example that combines the above steps to land with a single script.
# summarize the data from pandas import read_csv # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # shape print(dataset.shape) # head print(dataset.head(20)) # descriptions print(dataset.describe()) # class distribution print(dataset.groupby('class').size())
After assessing the data in the raw format, it’s time to give the data a visual representation for a better understanding of the dataset. We use univariate plots, and multivariate for the same.
In the case of univariate, we give the attributes a unique visual representation.
Here, since we have numeric information about the input variables, we carve out a box and whisker plot.
# box and whisker plots dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False) pyplot.show()
To spot the nature of distribution, we can have a histogram representation of each input variable.
# histograms dataset.hist() pyplot.show()
Here, we can see that two variables exhibit Gaussian distribution, and we can use this information with relevant algorithms.
In the case of Multivariate plots, we assess the correlation between the input variables. A scatterplot matrix representation will help in a better understanding of the relationship.
# scatter plot matrix scatter_matrix(dataset) pyplot.show()
In the figure above, you can see a diagonal clustering of attributes, this indicates a high correlation and predictable properties.
All the elements of data visualisation are brought to a single reference script.
# visualize the data from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # box and whisker plots dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False) pyplot.show() # histograms dataset.hist() pyplot.show() # scatter plot matrix scatter_matrix(dataset) pyplot.show()
It is time to check the accuracy of unseen data through data modeling.
Here, we first check the authenticity of the model created.
We will follow a dual approach here, first, we will check the accuracy of the data model created through statistical methods.
Secondly, we leave some of the unseen data away from the algorithmic working to actually assess the accuracy of the best model created on unseen data.
Here, 80% of the dataset will be passed through a training and evaluation process. We will select our model from this 80%. Rest 20% will be kept to be used as a validation dataset.
# Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
To judge the extent of model accuracy, we use a 10-fold cross method. Here, among the 10 parts, 9 parts are used for training, while the rest 1 part is used for testing, The same is applied to the entire train-test split.
To make sure that algorithms are tested on the same test split, we obtain a fixed number for the random seed using the #random_state argument.
The accuracy of the model is ascertained using the accuracy metric. It is basically the ratio of the total number of correct prediction instances, to the total number of instances in the dataset. The ratio is expressed in percentage.
Since we are not clear about which algorithm can be applied to solve the given problem, more so, we are unsure about the configuration to use. We apply a blend of linear and nonlinear algorithms. Machine Learning With Python can be an important step here.
Linear algorithm- Logistic Regression, and Linear Discriminant Analysis (LDA), Non-linear algorithm- K-Nearest Neighbors, Classification and Regression Trees, Support Vector Machines, and Gaussian Naive Bayes.
# Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
We need to assess the raw values for each of the six models, and we need to compare them. One of the best ways to compare the mean accuracy and spread of the models is to plot the results.
The sample results for each algorithm can be analyzed with box and whisker for independent distributions so that they can be compared with ease.
LR: 0.960897 (0.052113) LDA: 0.973974 (0.040110) KNN: 0.957191 (0.043263) CART: 0.957191 (0.043263) NB: 0.948858 (0.056322) SVM: 0.983974 (0.032083)
The complete example is given below for reference purposes.
# compare algorithms from pandas import read_csv from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())) # Compare Algorithms pyplot.boxplot(results, labels=names) pyplot.title('Algorithm Comparison') pyplot.show()
Read also: A Detailed Guide To Find and Hire Python Developers (Skills, Mistakes, Cost & Salary)
Now, here we pick an algorithm for predictions.
In step 5, you might have observed that the SVM model has the highest accuracy, so we will employ this model.
Here, we ascertain the accuracy of the model based on the validation dataset, we kept aside earlier.
We can apply the model on the dataset in its entirety, and churn out predictions for the dataset validation.
# Make predictions on validation dataset model = SVC(gamma='auto') model.fit(X_train, Y_train) predictions = model.predict(X_validation)
Here the predictions are evaluated, as per the predicted results in the validation set. Post which we can go ahead with the calculation of classification accuracy alongside classification report, and confusion Matrix.
# Evaluate predictions print(accuracy_score(Y_validation, predictions)) print(confusion_matrix(Y_validation, predictions)) print(classification_report(Y_validation, predictions))
Below you can see the accuracy rate as 96.67%, also the classification report, which information about precision, recall, f1-sore, and report.
0.9666666666666667 [[11 0 0] [ 0 12 1] [ 0 0 6]] precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11 Iris-versicolor 1.00 0.92 0.96 13 Iris-virginica 0.86 1.00 0.92 6 accuracy 0.97 30 macro avg 0.95 0.97 0.96 30 weighted avg 0.97 0.97 0.97 30
The complete example for reference is given below in a single script.
# make predictions from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.svm import SVC # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1) # Make predictions on validation dataset model = SVC(gamma='auto') model.fit(X_train, Y_train) predictions = model.predict(X_validation) # Evaluate predictions print(accuracy_score(Y_validation, predictions)) print(confusion_matrix(Y_validation, predictions)) print(classification_report(Y_validation, predictions)
With this, we come to an end of the comprehensive guide on Machine Learning with Python. We have delved deep into the ‘whys’ and ‘hows’ of Machine Learning and Python.
We would like to end with a single point, you should not feel bogged down if you do not have prior programming knowledge.
The specialty of Python is that it is unique, and has a menial link with other forms of programming language. So, you can start afresh without any apprehension.
We hope you had a great time reading this article and it proves to be of great value for any Python Development Company. Thank You.!
Your information is safe with us. eSparkBiz guarantees 100% data security. We don’t use emails for spamming.