Thursday, March 1, 2018

Machine Learning, Python, Pandas

Python and DataScience:

Object oriented language
Dynamically typed language
Easy to learn
Suitable for Data Science, Scientific analysis
Pandas - Python Data Analysis Library - Allows to read and manipulate data efficiently
Another alternative to Data science analysis is  R
R has lot of support for data manupulation

Pandas

Pandas is a library for data manipulation and analysis. -Data Science or data analytics is a process of analyzing large set of data points to get answers on questions related to that data set
-The library provides data structures and operations for manipulating numerical tables and time series.
-If dataset size is billions of records than tranditional tools Excel will not work
-PyCharm community edition
-Jupyter notebook to work with Pandas library
-df['Temparature'].max() - find max temparature
-df['EST'][df['Events']=='Rain'] - Days it rained
-df['WindsppedMPH'].mean() - mean windspeed
-Data munging or data wrangling - Process of cleaning messy data
- Pandas comes with Anaconda distribution, or pip install

Python Pandas Library - https://www.youtube.com/watch?v=F6kmIpWWEdU

Terms:

  • Mean - is Sum of all numbers divided by total number of values
  • Median - middle point of a number set
  • Standard deviation
  • -is a measure used to quantify the amount of variation or dispersion of a set of data values.
  • -A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

Jupyter Notebook:

Install anaconda and it gets Jupyter

start Jupyter notebook:

>jupyter notebook - this starts notebook server on eg: http://localhost:8888/notebooks/Pandas-First-Jupyter-notebook.ipynb#

>import pandas as pd
>df = pd.read_csv("/home/bhanu/github/MachineLearning-Pandas/Sample.csv");
>df['temperaturemin'].mean()  -- Prints mean of column data

"Fitting a linear regression" - Linear regression is a powerful and commonly used machine learning algorithm. It predicts the target variable using linear combinations of the predictor variables. Let's say we have a 2 values, 3, and 4. A linear combination would be 3 * .5 + 4 * .5. A linear combination involves multiplying each number by a constant, and adding the results. You can read more here.

Linear regression only works well when the predictor variables and the target variable are linearly correlated. As we saw earlier, a few of the predictors are correlated with the target, so linear regression should work well for us.
Example data:

date temperaturemin temperaturemax precipitation snowfall
2007-01-13 48 69.1 0 0
2007-01-19 34 54 0 0
2007-01-21 28 35.1 0.8 0
2007-01-25 30.9 46.9 0 0
2007-01-27 32 64 0 0
2007-02-05 19.9 39.9 0 0
2007-02-08 27 48 0 0


Unsupervised learning algorithm:

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells
- unsupervised learning algorithm
The algorithm works as follows:

First we initialize k points, called means, randomly.
We categorize each item to its closest mean and we update the mean’s coordinates, which are the averages of the items categorized in that mean so far.
We repeat the process for a given number of iterations and at the end, we have our clusters.

Supervised learning algorithm:

Linear regression is a statistical method of finding the relationship between independent and dependent variables.
The Linear Regression uses Slope - Intercept form of a line. The equation of a line in slope intercept form is given by:

y=mx+b
'x' is our independent variable.
'm' is the slope. It is measure to tell how steep our line.
'b' is the intercept. It tells where the line crosses y-axis
The very idea of Linear Regression is to find the best combination of slope (m) and intercept (b) which minimizes the SSE (Sum Of Squared Error) .

Github: https://github.com/gopularam/developer/tree/master/MachineLearning-Pandas

References:


  1. http://pandas.pydata.org
  2. https://www.dataquest.io/blog/machine-learning-python/ - Best one includes example of supervised and unspervised learning
  3. https://www.quora.com/How-would-linear-regression-be-described-and-explained-in-laymans-terms
  4. https://www.youtube.com/watch?v=CmorAWRsCAw

No comments:

Post a Comment