Skip to main content

Data Mining & Analytics with R : Running R Scripts and Data Mining Techniques - Day 2

Warning: These are my messy study notes, much better legible notes can be found here http://onepager.togaware.com

1. A Tour Thru Rattle


Transform Tab ( by no means near to the full power of underlying R)
Data Mining Tabs
- Cluster, Associate Model
Log Tab 
- Capture the corresponding R command
  • Working from Left to Right on Tabs 
  • Remember to Click Execute Button
  • 'Save' -> Projects save the current state, all models etc.
  • 'Open' Projects can be restored at a later time
  • You can even load it back to R 

2. First R Program

Load rattle and ggplot2

library(rattle)  # Provides the weather dataset 
library(rattle)  # Provides the ggplot() function
ggplot -> Grammar of Graphics : Just like english grammar or grammar of a computer language. A result of Hadley's Phd . Look him up to learn more details.

Then produce a plot using ggplot()
# handsondatascience.com - tips on elegantly writing repeated code
ds <- weather 

1. ggplot(ds, aes(x=MaxTemp, y=MinTemp)) + geom_point()

aes - aesthetics ( x axis and y axis, colors etc)
geom_point - you want  points as the geometric indicators 

2. ggplot(ds, aes(x=MaxTemp, y=MinTemp)) + geom_point() + ggtitle("Daily Temp Obs")


3. ggplot(ds, aes(x=MaxTemp, y=MinTemp, colour=RainTomorrow)) + geom_point() + ggtitle("Daily Temp Obs")



GGWIZ

Google R Gallery to find lot of graph implementations 

3. DMT : Clustering (Classification)


Cluster Analysis: 
A collection of observations .
Has been done for centuries : Classifying people, animals, mammals etc.
I cannot understand scientifically about each one of you without any historic background

Cluster - KMeans (Number of clusters :2)


Cluster - KMeans (New cvs file audit.csv)

Ideal number of cluster is 12, This is how to choose it 


Google  : Curse of dimensionality ( use ewkm for clustering if you have lot of variables)

If you want to do clustering on categoric values eg. male, female. Use:
Transform -> Recode-> Indicator Variable
Transform -> Recode -> As Numeric

Difference between two cars 

Can imagine number of pistons as their numeric value
Or some parameters that indicate luxury



4. DMT : Association Rule Mining (Recommendation)


It's what Amazon did for suggesting  books. 
The beer and baby diaper example

Link analysis
Market basket analysis
Cross Marketting

Math n CS -> High Distinction
[91%, 75%] [support, confidence]

Gladiator n Patriot -> Sixth Sense
[0.1%, 90%]

Statins n Peritonitis -> Chronic Renal Failure
[0.1%, 32%]

Gladiator n Patriot -> Sixth Sense
[0.1% - support, 90% - confidence]
(http://onepager.togaware.com/ association analysis)
support ->
out of 1000 cart 0.1% of people have all 3 of those movies.
i.e. 10 people have these in shopping cart

confidence->
if they have watched gladiator and patriot, 90% of the time they have watched Sixth sense.

lift ->
confidence / support : The higher the lift the better 



Health Insurance Commission 
6.8 million records x 120 attributes (3.5 GB) 
12 months preprocessing then 2 weeks data mining

Goal : find associations between tests

cmin = min confidence
smin = min support


Hands On







5. DMT: Predictive Data Mining: Decision Trees (Prediction)


Often referred to as supervised learning ( we already have a decision)

Like deciding if a we should lend money to a person 
-> We will have a model that can be used to arrive at the decision. The model would have been build by 
How do we find a good model?
There can be infinite number of models 
1. Write down infinite number of models ( we will take infinite time to search) [2]
2. Measure each model and find the best one

[2] We use heuristics search to see how good a model is

In the room example: Weather a person would be wearing glasses? 

  • 30% females are wearing glasses
  • 60% males are wearing glasses 
  • (60% is not accurate enough) so, we will further divide by age:
  • People above age 42 has a 80% of chance of wearing a glass

If this is not effective enough then the algorithm starts taking other parameters and try to get better models

But how do I choose the best variables to reduce search time and get the best model?

Formula for entropy (disorder ) - nlogn 

Induction Tree - Greedy Algorithm (Heuristics - Goodness)# Important

  • Partition by every variable gender, age, height, shirt colour, shoe colour
  • Check which variable maximise reduction in entropy

Hands On Rattle


rpart - recursive partitioning 

type 1, type 2, type 3, type 4 errors
true positive - it will rain and it rains
false positive - it will rain and it doesn't rain
true negative - it won't rain and it doesn't rains
false negative - it won't rain and it rains ( i don't want this, i'll get wet)

Chances of No  sis .84% 

Very widely used - has been there for a long time. 

Democracy doesn't always give us the answer. It it did, the world would be still flat. 



5. DMT - Evaluation Tab - Error Matrix in Rattle (To evaluate, improve goodness)


Always look at the log tab to find code of good models

Type:
Error Matrix 
This is how we can see the rate of false positives etc.

This is how you can avoid you getting wet 

Evaluation of params for the above 0,10,1,0 Loss matrix 


6. DMT - Evaluation Tab - ROC in Rattle (To evaluate, improve goodness)


Used in 2nd world war

The orange marked is the ideal curve. Red is what we have now.







16% of the days it rain, 84% of the days it won't rain
or
16% of people need to pay tax, auditing the other 84% is waste
Y-axis : True Positive
X-axis: Is the number of people that i'm going to audit ( which is not productive)

Black lines says: What happens if i audit randomly ,
I'll get 16 out of 16 (y-axis to 100%) payers when I audit 50 people. 
I'll get 8 out of 16 payers when I audit 50 people. This is the baseline ( obvious path, but not effective)

But the green line says: 
I'll rank people, and order them form 1-100 in x-axis. I'll get 10 out of 16 people when I reach 20 people.
I'll find all 16 by the time I reach 50 people

Assignment, try these:

1. Model  > Linear
and also
2.Evaluate > Linear

1. Model > All
2. Evaluate > All 


Popular posts from this blog

ICFAI Sikkim Distance MBA Review From My Experience

After a long research I joined for the ICFAI distance MBA program in 2012. Now I've completed 2 semesters ( as of 2013 ). I wanted to write this review so that people who are looking for a good MBA program can get a hand-on review about the distance MBA offered by the ICFAI Sikkim. I've been through all the cycles of this program and this review might help you make the right choice about the program. This article presents my own ( and unbiased ) view of the program and is in no way associated with the course provider.


Is the MBA ICFAI Sikkim Approved By UGC?  As per the latest AICTE regulations, a distance education program must have the approval of a joint commission of  UGC- University Grant CommissionAICTE- All Indian Council for Technical EducationDEC- Distance Education CouncilICFAI Sikkim doesn't have this approval (don't get disappointed, it's not over yet). Only institutes and colleges affiliated to a University are required to take AICTE approval. So ICFAI be…

Best Places To Eat at Trivandrum

Are you searching for the most amazing places to eat at Trivandrum? Well, this article is a collection of places from where I've carried the taste after eating. All of them are located in Trivandrum. There is no specific focus for a single kind or restaurant, I've written about the latest hot-spot cafes to ethnic and traditional places at the heart of the city and people for decades.
General Places Azad Hotel Azad Hotel is a hotel-chain all over India and they are one of the best hotels in Trivandrum too. They claim to have introduced the popular dish 'biriyani' in India. However, they provide a good ambience and tasty food. There are all kind of popular non-vegetarian dishes available here. They have a long tradition of serving quality food and that's what makes them the best. There are
Zam Zam This is the most sought after destination in city for the best cooked chicken dishes. The 'Shawai' is the all time best seller of Zam Zam. It's very rushy alwa…

Is MacBook Air Good For Programming / Blogging ?

I'm a passionate java developer who just migrated from a Windows PC netbook ( Dell mini ) to a 13 inch MacBook Air. Before the netbook I owned a Dell inspirion 1501. I'm quite a bit of an avid blogger as well. I purchased Dell mini just as it was launched hoping that it's compact and mobile architecture would solve all of my need as a programmer and a writer. Unfortunately it turned out that it was a worthless device.The rest of the story goes...



Do Not Compare a Netbook With MacBook AirMacBook's astonishing features far exceeds anything that of a normal Netbook.Before
Buying a netbook for programming and blogging was one of the biggest blunders I ever made on choosing a machine. The screen was 11 inch and clumsy icons of the Windows were a disgrace all the time.The tightly arranged keys in the keyboard made typing a pain. It's slow Intel Atom Processor is too sluggish to run even VLC player.


After  The Mac's backlit spacious keyboard layout, 1440x900 resolution d…