Table of contents

  1. Exploratory Data Analysis in Python

  2. Building Binary classification model using GBM and H2o Library

  3. Grid Search in H2o

  4. Data connections & Pipelines in Python - Hive2Python and Serial connection

  5. Basic Webscapping using BeautifulSoup library

  6. Topic modelling using LDA

  7. N gram analysis using NLTK

Exploratory Data Analysis in Python

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations

To share my understanding of the concept and techniques I know,I’ll take an example of the famous Titanic dataset

Thanks to Kaggle and encyclopedia-titanica for the dataset
import pandas as pd

df_input = pd.read_csv("titanic.csv",sep=",")
df_input.head(4)
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000

In this dataset, the Column “Survived” is the target binary variable which tells if the particular id survived or not

##One can get alot of insights for numeric fields from the describe function of pandas

df_input.describe()
Survived Pclass Age Siblings/Spouses Aboard Parents/Children Aboard Fare
count 887.000000 887.000000 887.000000 887.000000 887.000000 887.00000
mean 0.385569 2.305524 29.471443 0.525366 0.383315 32.30542
std 0.487004 0.836662 14.121908 1.104669 0.807466 49.78204
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.00000
25% 0.000000 2.000000 20.250000 0.000000 0.000000 7.92500
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.45420
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.13750
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.32920
## Lets check for unique values for each column

print(df_input.nunique())
Survived                     2
Pclass                       3
Name                       887
Sex                          2
Age                         89
Siblings/Spouses Aboard      7
Parents/Children Aboard      7
Fare                       248
dtype: int64
print(df_input.dtypes)
Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object
### You must have noticed that the unique value count for column Pclass is only 3 but its a numeric field.
## Lets convert it into an object dtype so as to treat it as categorical

df_input["Pclass"] = df_input["Pclass"].astype("O") 
print(df_input.dtypes)
Survived                     int64
Pclass                      object
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object
def univariate(data):
    #Creating seperate dataframes for categorical and continuous variables
    data_cat = data.select_dtypes(include=[object])    
    data_cont = data.select_dtypes(exclude=[object])


    data_cont_univ = data_cont.describe(percentiles=[.001,.005,.01,.25,.5,.75,.95,.99,.995,.999]).transpose()

    data_cont_univ["blanks"] = data_cont.isna().sum(axis=0) + data_cont.isnull().sum(axis=0)

    data_cont_univ["zeros"] = (data_cont == 0).sum(axis=0)

    data_cat_univ = data_cat.describe().transpose()

    data_cat_univ["blanks"] = data_cat.isna().sum(axis=0) + data_cat.isnull().sum(axis=0)

    data_cat_univ["zeros"] = data_cat.isin(["0","00","000","0000","00000","000000"]).sum(axis=0)

    return data_cat_univ,data_cont_univ
data_cat_univ,data_cont_univ = univariate(df_input)
data_cat_univ.head()
count unique top freq blanks zeros
Pclass 887 3 3 487 0 0
Name 887 887 Mr. Sarkis Lahoud 1 0 0
Sex 887 2 male 573 0 0
data_cont_univ.head()
count mean std min 0.1% 0.5% 1% 25% 50% 75% 95% 99% 99.5% 99.9% max blanks zeros
Survived 887.0 0.385569 0.487004 0.00 0.0000 0.00 0.0 0.000 0.0000 1.0000 1.00000 1.000000 1.000 1.0000 1.0000 0 545
Age 887.0 29.471443 14.121908 0.42 0.6415 0.83 1.0 20.250 28.0000 38.0000 55.85000 66.000000 70.285 74.6840 80.0000 0 0
Siblings/Spouses Aboard 887.0 0.525366 1.104669 0.00 0.0000 0.00 0.0 0.000 0.0000 1.0000 3.00000 5.000000 8.000 8.0000 8.0000 0 604
Parents/Children Aboard 887.0 0.383315 0.807466 0.00 0.0000 0.00 0.0 0.000 0.0000 0.0000 2.00000 4.000000 5.000 5.1140 6.0000 0 674
Fare 887.0 32.305420 49.782040 0.00 0.0000 0.00 0.0 7.925 14.4542 31.1375 112.55749 249.600388 263.000 512.3292 512.3292 0 15
## Lets do the bivariate now 

def bivariate(data,ignore_col_list,event):
    
    #Subsetting categorical variables for bivariate analysis
    col_list_tmp = list(df_input.select_dtypes(include=[object]).columns)+[event]
    col_list = [w for w in col_list_tmp if w not in ignore_col_list]
    data_cat = data[col_list]
    
    cols = list(data_cat.columns.values)
    
    #Looping bivariate analysis for all variables and appaending the results in a single dataframe
    #The len()-2 is to exclude the variables for which bivariate is not needed (as per the column sequence)
    appended_data = pd.DataFrame()
    for x in range(0, len(data_cat.columns)-1):
    
        data2 = pd.DataFrame({'1.Variable':data_cat.columns[x],
                              '2.Level':data_cat.groupby(data_cat.columns[x])[event].sum().index,
                              '3.Event':data_cat.groupby(data_cat.columns[x])[event].sum(),
                              '4.Volume':data_cat.groupby(data_cat.columns[x])[event].count(),
                              '5.Rate':((data_cat.groupby(data_cat.columns[x])[event].sum()/data_cat.groupby(data_cat.columns[x])[event].count())*100).round(2)})
    

        appended_data = appended_data.append(data2)
    return appended_data
df_bivariate = bivariate(df_input,['Name'],"Survived")
df_bivariate.head()
1.Variable 2.Level 3.Event 4.Volume 5.Rate
1 Pclass 1 136 216 62.96
2 Pclass 2 87 184 47.28
3 Pclass 3 119 487 24.44
female Sex female 233 314 74.20
male Sex male 109 573 19.02

We could now build beautiful plots for these results, infact users can go ahead and change the body of our bivariate function to create python plots

# import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# if using a Jupyter notebook, includue:
%matplotlib inline
### Plotting a Grid of plots for bivariate between different columns in the data.

#Caution : Might be very slow depending upon datasize

sns.pairplot(df_input, hue=”Survived”)

Click here to go to Seaborn documentation.

Using the above plots, we can see how closely two columns are interating with one another

At the same time, the hue based on our dependent variable helps us idenitfy potential features in predicting dependent variable

Building a Binary classification model using GBM in H2o Library

Key Terms used in this section

Gradient Boosting Machine (GBM)

Gradient Boosting Machine (for Regression and Classification) is a forward learning ensemble method. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations.

H2O Library

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more.

In a grid search, users specify a set of values for each hyperparameter that they want to search over, and H2O will train a model for every combination of the hyperparameter values. This means that if you have three hyperparameters and you specify 5, 10 and 2 values for each, your grid will contain a total of 5102 = 100 models

Lets take the Titanic Dataset and try to predict the “Survived” Attribute using H2o GBM ML Technique

import h2o
import os
import pandas as pd
import numpy as np
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.grid.grid_search import H2OGridSearch
try:
    h2o.shutdown()
except:
    try:
        h2o.init()
    except:
        h2o.connect()

train_df = h2o.import_file('titanic.csv')
Parse progress: |█████████████████████████████████████████████████████████| 100%
## Pass the target variable to be predicted
y = 'Survived'

### Pass the list of predictors 
x = ['Pclass','Age','Siblings/Spouses Aboard',	'Parents/Children Aboard',	'Fare']

# If target variable is binary, convert it to factor
train_df[y]= train_df[y].asfactor()
## Split the dataset into train and valid frames
train, valid = train_df.split_frame(ratios=[.8], seed=1234)
### Lets define our grid parameters. It is same as running a for loop to generate models from the same ML algorithm but with different hyper parameters

gbm_params2 = {'learn_rate': [i * 0.01 for i in range(4, 5)],
                'max_depth': list(range(11, 12)),
                'sample_rate': [i * 0.1 for i in range(7, 9)],
                'col_sample_rate': [i * 0.1 for i in range(8, 9)],
               'col_sample_rate_per_tree': [i * 0.1 for i in range(7, 8)],
              'learn_rate_annealing':[0.99]
              }

# Search criteria
search_criteria = {'strategy': 'RandomDiscrete', 'max_models': 20, 'seed': 1}

# Train and validate a random grid of GBMs
gbm_grid2 = H2OGridSearch(model=H2OGradientBoostingEstimator,
                          grid_id='gbm_grid',
                          hyper_params=gbm_params2,
                          search_criteria=search_criteria)
    gbm_grid2.train(x=x, y=y,
                training_frame=train,
                validation_frame=valid,
                ntrees=120,
                seed=1,ignore_const_cols = 1)

Exporting the results to a csv file

ModelSummary = pd.DataFrame(columns=['Model_ID','learn_rate', 'Trees', 'Depth', 'Row_Sampling', 'Col_Sampling', 'MTries','NBins', 
                                    'Categorical_Encoding', 'Hist_Type', 'Training_Log_Loss', 'Validation_Log_Loss', 
                                    'Training_AUC', 'Validation_AUC'])

ModelSummary2 = pd.DataFrame(columns=['Model_ID', 'learn_rate','Trees', 'Depth', 'Row_Sampling', 'Col_Sampling', 'MTries','NBins', 
                                    'Categorical_Encoding', 'Hist_Type', 'Training_Log_Loss', 'Validation_Log_Loss', 
                                    'Training_AUC', 'Validation_AUC'])
for Model in gbm_grid2:
    ModelSummary['Model_ID'] = [Model.model_id]
    ModelSummary['learn_rate'] = [Model.actual_params.get('learn_rate')]
    ModelSummary['Trees'] = [Model.actual_params.get('ntrees')]
    ModelSummary['Depth'] = [Model.actual_params.get('max_depth')]
    ModelSummary['Row_Sampling'] = [Model.actual_params.get('sample_rate')]
    ModelSummary['Col_Sampling'] = [Model.actual_params.get('col_sample_rate_per_tree')]
    ModelSummary['MTries'] = [Model.actual_params.get('mtries')]
    ModelSummary['NBins'] = [Model.actual_params.get('nbins')]
    ModelSummary['Categorical_Encoding'] = [Model.actual_params.get('categorical_encoding')]
    ModelSummary['Hist_Type'] = [Model.actual_params.get('histogram_type')]
    ModelSummary['Training_Log_Loss'] = [Model.logloss()]
    ModelSummary['Validation_Log_Loss'] = [Model.logloss(valid=True)]
    ModelSummary['Training_AUC'] = Model.auc()
    ModelSummary['Validation_AUC'] = Model.auc(valid=True)
    ModelSummary2 = ModelSummary2.append(ModelSummary, sort = False)

ModelSummary2.to_csv("gbm_grid_results.csv")

Building data connections between different servers

In building machine learning applications, sometimes the data resides in more than 1 server or disk. Let’s say we have a client server from where we indend to fetch data into our server for data processing and push back results to the client server.

In the above scenario, following steps could be helpful

  1. Sqoop Import Data from server and store its metadata in your server.
  2. Perform data processing and analysis
  3. Sqoop Export output data back to the client server

Let’s see how we can do this in Python

Hive and Python

################# Hive connection in Python using pyhive package 

## In Python, we have this library called pyhive, this can be used to access a Hive table residing in a HDFS location
import pandas as pd
from pyhive import hive
import time
statement = '''SELECT * from dbo.tablename limit 10'''


print("Setting up connection")

### In the connect function, we will establish the connection between the hive database and our Python Kernel
conn = hive.connect(host='<server name>',\
                    port=<port number>,\
                    auth='LDAP',\
                    username='<id>',\
                    password="<password?>")

print("Connection successful")
start1=time.time()
start=time.ctime()
print( "started at:"),start

cur = conn.cursor()
#### The below command will fetch the results for the query named "statement"
cur.execute(statement)
df = cur.fetchall()
### The output of the fetchall returns a List and doesnt contain the column names. So we will have to pick up the schema 

### The below command will pull the column names as a list
cur.execute("show columns in dbo.tablename")
df1 = cur.fetchall()


col_names=[w[0] for w in df1]
print( "Data fetch completed at:"),time.ctime()
print(  "Time taken: "),time.time()-start1


print("output saved as a pandas df")

df_final=pd.DataFrame(df,columns=col_names)

SSH Connection in Python

Using Paramiko to establish SSH connection in Python

################ Serial connection in Python 

Sometimes ,we intend to establish a serial data connection capable of transmitting datastream like this 100000000000100000000010000

### Loading the package paramiko for establishing serial connection SSH

import paramiko

print("Initiate a serial SSH client handle")

ssh_client =paramiko.SSHClient()

### Below command is an optional command required
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())

### Below command actually creates the connection with the UNIX

ssh_client.connect(hostname="<server location>",username="<id>",password="<passwordhere>")

handle = ssh_client.open_sftp()

######### Reading file from HDFS location into python session

file_path =handle.file("<remote server location to be read>")

data_read= pd.read_csv(file_path)

### Write the file to local location

data_read.to_csv("testinfile.csv")

#### Writing back file from domino location to HDFS location 

handle.put("testinfile.csv","<remote server location where we want to write back")

Note that SSH connection is very very slow since its a serial connection and transmits data bit by bit.

So, please use it for data transmission for only small datatsets.

For big datasets, use jdbc or odbc connection in Python or R

Web-Scrapping in Python

Here we will cover different libraries in python which can be used in scrappinng data from public websites

Put on your hacking cap and lets break the myths and jazziness around web scrapping

Important notes before scrapping data from any website

  1. Please check in advance if its legal to scrap data from that website by inspecting a few compliance related aspects using robot.txt
  2. Inspect the structure of the website and analyse how the flow and schema of that website looks like
  3. always improvise while fetching desired data.
### requests library is used for fetching the dta from a html page
import requests
import urllib.request
import time
## BeautifulSoup is a very cool library, you can also check out scrapy library ( highly autmoated and parameterized )
## to study more about this lib, refer https://www.crummy.com/software/BeautifulSoup/bs4/doc/
from bs4 import BeautifulSoup

## Target URL
url = 'http://web.mta.info/developers/turnstile.html'

response = requests.get(url)

## if the previuos command was successful, then 

soup = BeautifulSoup(response.text, html.parser)

## in scrapping, the whole intention is to pull labels satisfying either a string or a regex
## here i am planning to pull all labels containing "coders"

all_tags = soup.findAll('coders')

##This code gives us every line of code that has an <coders> tag in form of a "list"

# now lets try to actually reach out to some underlying url with tag "a"

single_tag = all_tags[12]
link = single_tag["href"]

download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./’'link[link.find('/turnstile_')+1:]) 


note that a request command could take from few ms to to a few seconds, so when you try to run a loop, do keep sleep commands in the code

Topic modelling & N-Gram in Python

import pandas as pd
import re
import numpy as np
import matplotlib
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from gensim import corpora, models
from nltk.stem.wordnet import WordNetLemmatizer
import string
from nltk.stem import PorterStemmer
from nltk import word_tokenize

LDA For Topic Modelling

LDA is an iterative algorithm. Here are the two main steps:

  1. In the Initialization stage, each word is assigned to a random topic.

  2. Iteratively, the algorithm goes through each word and reassigns the word to a topic taking into consideration

    a. What’s the probability of the word belonging to a topic

    b. What’s the probaility of a document to be generated by a topic

########## Data cleaning for Topic modelling 
def data_cleaning(tweet,custom_list):
    tweet = re.sub(r'\$\w*','',tweet) # Remove tickers
    tweet = re.sub(r'https?:\/\/.*\/\w*','',tweet) # Remove hyperlinks
    tweet = re.sub(r'['+string.punctuation+']+', ' ',tweet) # Remove puncutations like 's
    #stop = set(stopwords.words('english'))
    tweet = re.sub(r'[^a-zA-Z0-9]'," ",tweet)
    stop_words=set(['a',    'about', 'above', 'after',   'again',  'against',              'ain',      'all',        'am',               'an',       'and',     'any',     'are',      'as',        'at',        'be',       'because',            'been',   'before',               'being',  'below', 'between',           'both',   'but',      'by',        'can',     'couldn',               'd',               'did',      'didn',    'do',       'does',   'doesn', 'doing',  'don',     'down',  'during',               'each',               'few',     'for',      'from',   'further',              'had',     'hadn',   'has',      'hasn',   'have',   'haven',               'having',               'he',       'her',      'here',    'hers',    'herself',              'him',     'himself',               'his',       'how',    'i',           'if',         'in',         'into',     'is',         'isn',       'it',         'its',        'itself',               'just',     'll',          'm',        'ma',      'me',      'mightn',              'more',  'most',   'mustn', 'my',               'myself',               'needn', 'now',    'o',         'of',        'off',      'on',       'once',   'only',    'or',               'other',  'our',      'ours',    'ourselves',          'out',      'over',    'own',    're',        's',          'same',               'shan',   'she',      'should',               'so',        'some',  'such',    't',          'than',    'that',    'the',               'their',   'theirs',  'them',  'themselves',      'then',    'there',  'these',  'they',    'this',     'those',               'through',            'to',        'too',     'under', 'until',    'up',       've',        'very',    'was',     'we',               'were',   'weren',               'what',   'when',  'where',               'which', 'while',  'who',    'whom',               'why',    'will',      'with',    'won',    'y',          'you',     'your',    'yours',  'yourself',               'yourselves'])
    exclude = set(string.punctuation)
    exclude1= set(custom_list)
    stop_words.update(exclude1)
    lemma = WordNetLemmatizer()
    stop_free = " ".join([i for i in tweet.lower().split() if i not in stop_words])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
########## Pass custom list of stop words in the following list "custom_list"
	
custom_list=["fall"]
####### Load input data for text analysis 
text1 = pd.read_csv(r'path.csv', sep=',', encoding='iso-8859-1')

text1.dropna(axis=0,how='any',inplace=True)

#### Here "Text" column is our target column 
text=text1["Text"].values.tolist()

####### mention number of topics 
NUM_TOPICS = 10
STOPWORDS = stopwords.words('english')

### text column was converted into a list for easy iterations 
tweet_text=text  # text is a list of column to be modelled
doc_complete=tweet_text
#doc_complete_2=[]
doc_complete_2=tweet_text


a = 0 
 
### For gensim we need to tokenize the data and filter out stopwords

tokenized_data = []
for w in doc_complete_2:
    a=a+1
    tokenized_data.append(data_cleaning(w,custom_list))

# Build a Dictionary - association word to numeric id
dictionary = corpora.Dictionary(tokenized_data)

# Transform the collection of texts to a numerical form

corpus = [dictionary.doc2bow(text) for text in tokenized_data] 
# Have a look at how the Nth document looks like: [(word_id, count), ...]
# [(12, 3), (14, 1), (21, 1), (25, 5), (30, 2), (31, 5), (33, 1), (42, 1), (43, 2),  ...
 # Build the LDA model

### Lets create a LDA model object specifiying corpus , number of topics 5,6 ... and the words dictionary 

lda_model = models.LdaModel(corpus=corpus, num_topics=NUM_TOPICS, id2word=dictionary)
final=[]
for topic in lda_model.show_topics(num_topics=NUM_TOPICS, formatted=False, num_words=6):
    topicwords = [w for (w, val) in topic[1]]
    topicwords_val = [val for (w, val) in topic[1]]
    final.append([topicwords,topicwords_val])
final1=pd.DataFrame(final,columns=["topic","prob"])



final1.to_csv(path+"/topics.csv")
  • The number of topics (n_topics) as a parameter. None of the algorithms can infer the number of topics in the document collection

  • All of the algorithms have as input the Document-Word Matrix (or Document-Term Matrix). DWM[i][j] = The number of occurrences of word_j in document_i

  • All of them output 2 matrices: WTM (Word Topic Matrix) and TDM (Topic Document Matrix). The matrices are significantly smaller and the result of their multiplication should be as close as possible to the original DWM matrix.

N Gram Analysis

########################################N-gram analysis

import re
import string
from nltk.corpus import stopwords
import string
from collections import Counter
from nltk import ngrams

############## Creating UDF for ngram analysis , n means number of grams 1,2,3 ....


############ Data cleaning for Ngram analysis 
def clean_text(text,custom_list):
    tokenized_text = word_tokenize(text.lower())
    cleaned_text = [t for t in tokenized_text if t not in STOPWORDS and re.match('[a-zA-Z\-][a-zA-Z\-]{2,}', t)]
    cleaned_text1 = [t for t in cleaned_text if t not in custom_list]
    return cleaned_text

def cal_ngram(text,n):
    token = nltk.word_tokenize(text)
    n_grams = ngrams(token,n)
    return n_grams

############# 
n = 3 #### Here n means number of grams

n_gram = cal_ngram(text[0],n)   # generating N grams for input data

### Counter library 
n_gram_common = Counter(n_gram).most_common()  # Listing top 10 trending N grams

n_gram_df=pd.DataFrame(n_gram_common,columns=["N Gram","Frquency"])

#### Now this n gram results can be used to publish a word cloud

End of the Document

Stay Tuned for more information