Monday, September 15, 2014

Cluster Analysis using R with banking customer balance distribution

Once a wise man (Yachine Phuneli) taught me while teaching me about java, semiconductor and wafer fabrication that how to teach/explain.

First tell to your audience/readers what are going to to tell.
Than tell the thing which you planned to tell.
And than tell what you just told.

I am going to describe how to do Cluster Analysis using R

Cluster Analysis
Cluster analysis has a vital role in numerous fields we are going to see it in the banking business to segment customers into small groups that can later be targeted for future marketing activities.
In machine learning and data mining it is used to efficiently find nearest neighbours and in summarization.

Cluster analysis aims to group data objects based on the information that is available that describes the objects and their relationships. The main goal is to group similar objects together, and the greater the similarity within a group the better and the greater the difference between group the more diverse the clustering.  A clustering is an entire collection of clusters; a cluster on the other hand is just one part of the entire picture. There are different types of clusters and also different types of clustering.

Types of Clustering Algorithms

1.     Partitioning-based clustering: are algorithms that determine all the clusters at once in most cases.
o    K-means clustering
o    K-medoids clustering
o    EM (expectation maximization) clustering
2.     Hierarchical clustering: these algorithms find successive clusters using previously established ones.
o    Divisive clustering is a top down approach.
o    Agglomerative clustering is a bottom up approach.
With the help of data mining methods, such as clustering algorithm, it is possible to discover the key characteristics from the bank’s data and possibly use those characteristics for future prediction also.
According to the facts established based on the information released by banks, attracting new customer costs five to six times more than customer retention. Retaining existing customers is the best core marketing strategy to become profitable in the very competitive banking industry. In order to maximise the profit, how to retain the existing customers has become a subject to be urgently solving for banks.

Use clustering to produce an initial working hypothesis, refine this hypothesis, then use prediction to generalize the refined hypothesis to data and evaluate how well it performs.




For this study following was the Bank customer data for the Balance distribution.


Before doing the cluster analysis above data was transformed based on the type or range of values. This would help us to do the scaling and find number of cluster quickly. We can keep the data as it is but I opted to find clustering based on the simplified data.
Balance conversion based on range:
=IF(AND(Sheet2!I2>=0,Sheet2!I2<=1500),"1",IF(AND(Sheet2!I2>1500,Sheet2!I2<=3000),"2",IF(AND(Sheet2!I2>3000,Sheet2!I2<6000 -=""> If Balance between 0 to 1500 : 1, If Balance between 1501 to 3000:2, If Balance between 3000 to 6000:3, Greater than 6000 :4
Age conversion based on the range:
=IF(AND(Sheet2!E2>=20,Sheet2!E2<=35),"1",IF(AND(Sheet2!E2>35,Sheet2!E2<=45),"2",IF(AND(Sheet2!E2>45,Sheet2!E2<=60),"3","4")))
Customer type conversion based on type::
=IF(Sheet2!C6="Enterprise","1","2")
Product conversion ::
=IF(Sheet2!C6="SavingAccount","1","2")  -> 1 if SavingAccount, 2 if CurrentAccount

So we got the data transformed in the following way:






    Find No Of Cluster

1) Load csv data from file:



 2)  Once data is loaded, per form the scaling of data by executing 
Ø  bDataScale <- ankcustomerdata="" o:p="" scale="">


 3)     Loading NBClust library and find the number of clusters for the data available in bdata,
Ø  library(NbClust)
Ø  nc <- bdatascale="" max.nc="15," method="kmeans" min.nc="2," nbclust="" o:p="">
Above command  takes a while to calculate.

Following is the output of NbClust function

The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a significant increase of the value of the measure i.e the significant peak in Hubert index second differences plot.

The D index is a graphical method of determining the number of clusters. In the plot of D index, we seek a significant knee (the significant peak in Dindex second differences plot) that corresponds to a significant increase of the value of the measure.

All 1000 observations were used.

*******************************************************************
* Among all indices:                                               
* 4 proposed 2 as the best number of clusters
* 9 proposed 3 as the best number of clusters
* 1 proposed 7 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 2 proposed 9 as the best number of clusters
* 2 proposed 11 as the best number of clusters
* 2 proposed 12 as the best number of clusters
* 2 proposed 15 as the best number of clusters

                   ***** Conclusion *****                           
 * According to the majority rule, the best number of clusters is  3 



4)      Plot the chart with the number of cluster we have obtained.




     Calculate K-means cluster
   Now to find the cluster based on Quarterly average balance, bdata loaded from .csv file is






Here I want to consider only AvgBalQ1, AvgBalQ2, AvgBalQ3, AvgBalQ4 for kmeans cluster.
Ø   test <-bdata c="" o:p="">
Scale data
Ø  scaledata <-scale o:p="" test="">
Set seet so that every time we calculate the kmeans it would be consistent
Ø   set.seed(1234)
Find the number of cluster/center as 3,
Ø   km <- centers="3," iter.max="500)<o:p" kmeans="" nstart="10," scaledata="">
Check the cluster size data, 401 data in cluster 1, 310 is cluster 2 and 289 in cluster 3.
Ø   km$size
[1] 401    310     289


Let’s have a look of the function kmeans

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))

Input to kmeans function
x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).
centers: Either the number of clusters or a set of initial (distinct) cluster centers. If a number, a random set of (distinct) rows in x is chosen as the initial centers.
iter.max: The maximum number of iterations allowed.
nstart: If centers is a number, nstart gives the number of random sets that should be chosen.
algorithm: The algorithm to be used. It should be one of "Hartigan-Wong", "Lloyd", "Forgy" or "MacQueen". If no algorithm is specified, the algorithm of Hartigan and Wong is used by default.

Result returned from kmeans function call
cluster: A vector of integers indicating the cluster to which each point is allocated.
centers: A matrix of cluster centers.
whithnss: The within-cluster sum of squares for each cluster.
size: The number of points in each cluster.

(Fig: Quarter wise Product data distribution)
Now if bank wants take some necessary actions on the data by grouping the customer base and start a product or campaign for them, For e.g.
a)     If we see the above data distribution where bank wants to suggest the high net worth/higher account balance customer from the saving account to opt for other product, or
b)     If bank wants to offer those customers or a group of customers (cluster) additional services without charges.
Partitioning of data and making cluster/group would be helpful to take necessary actions for every cluster. It would be easy to group customers in cluster and then plan business activities on respective clusters accordingly.
K-means clustering is the most popular partitioning method. It requires the analyst to specify the number of clusters to extract. 
 So if we group data in 3 clusters (see the next chapter to see how to do we get 3 as a number of cluster)


(Fig: Quarterly Average Balance with 3 cluster data distribution)
If we see these four graphs for every quarter, Cluster 3 (in blue color) has some customer whose balance is greater than customer in other clusters. We can mark these clusters back to the original data and find those specific customers from the cluster to take appropriate action. 
Kmeans clustering with 3 clusters of sizes 401, 310, 289
Cluster means:
           AvgBalQ1 AvgBalQ2 AvgBalQ3 AvgBalQ4
Cluster 1     4019     2189     5059     5497
Cluster 2     2808     6413     3630     4044
Cluster 3     5611     6527     7222     6638
Percentage of within cluster variance accounted for by each cluster:
Cluster 1: 42.22%,  Cluster 2: 26.07%  Cluster 3: 31.72%

Note:
For every measure or non-factor or numeric value based parameter we choose these graph would be keep                                                              changing. Because their cluster means would be changing and so their variance.
For e.g. If we try to obtain data with 3 cluster but by including age also in the data, than our cluster size, mean and cluster data percentage changes.
Kmeans clustering with 3 clusters of sizes 353, 366, 281
Cluster means:
            Age    AvgBalQ1   AvgBalQ2   AvgBalQ3 AvgBalQ4
Cluster 1                31.5      3658       3858       5972        4869
Cluster 2                59.8      3366        4640     2972        4882
Cluster 3                57.1      5624        6023     7279        6659
Percentage of within cluster variance accounted for by each cluster:
Cluster 1: 34.21%, Cluster 2: 34.84%, Cluster 3: 30.95%


Now lets do some more analysis with the with the distribution of the data.

Data Interpretation: Understand the data based on the different combination of factors/dimensions and measures.
Factors/Dimensions: A dimension is a broad grouping of related data about a major aspect of your business. For example, you have a dimension called Products.
Measure: A measure is a performance indicator that is quantifiable and used to determine how well a business is operating. For example, useful measures may be Average yearly balance.
In the following figures, I tried to figure out only two interpretation (there could be more) and two possible actions which can be taken.
Note: I not sure which city exists in which state in US. I have just generated cities and states using talend (how to generate data using talend, check in the previous post) for the purpose of this study, as these cities are states were available in RowGenerator for generating records and moved some of them to make data look better.


(Fig: Product wise Gender data distribution)

If we try to analyse the above graph, it suggest that there are few female customer for a bank who has current account as product. Which are interesting and a bank can interrupt it in two ways:
I.        Interpretation:
a.     Either these customers are unware about which product they should have and they are paying unnecessary charges for CA, rather than going for SA.
b.     Or, they are the some entrepreneur or running small business.
II.        Actions:
a.     Customer can be notified by the bank executive about the possible change in their product from CA to SA and win the loyalty of customer by showing them bank care for their customer’s money.
b.     Marketing team target these customers with their specific needs or keep these customer’s in mind for the future product which would meet for these specific female customer who has CurrentAccount and are entrepreneur.



(Fig: Age wise Gender data distribution)
I.        Data Interpretation:
a.     Male customers are more rather than female or enterprise customers.
b.     Enterprise customer’s mainly lying in the range of age from 38-67.
II.        Possible Actions:
a.     Either products for Enterprises and Female customer are not very effective, so few new product which is more suitable to them can be launched with some campaign, etc.
b.     Start-up(Enterprise) does not have any enough accounts with the bank as the most of the customers are in the range of 40-70 and few of them are from 20-40. If an organization/startup can open account with the bank, there would be chances to get more salary accounts also from the same enterprise customer.

(Fig: City wise Gender data distribution)
I.        Data Interpretation
a.     Albany city has more enterprise customers than all other cities.
b.     Atlanta city has more personal customer (male + female)
II.        Possible Actions:
a.     Any campaigns which are targeted for personal customers should include cities like Atlanta.
b.     Any campaigns which are targeted for enterprise customer should include cities like Albany.
(Fig: City wise Gender and State wise Gender data distribution)


(Fig: Per Quarter Average Balance per Gender)
I.        Data Interpretation
a.     Q3 and Q4 average balance increases in comparison to Q1 and Q2. See the movement of boxplots not only the dots, in all quarters.
b.     Q1 and Q2 Enterprise customer are more towards lesser balance in comparison to Q3 and Q4.
II.        Possible Actions:
a.     If these are the regular trends for every year, that shows that cash flow for these customer increases  in last two quarters than it would be more appropriate time for having campaigns. Bank would be able to get better return on their investment on those campaigns.
ROA = Margin * Asset Velocity
Asset velocity = Sales / Assets
More sales with more competitive product.
More sales mean more asset velocity.
More asset velocity means more return on assets.
More ROA is more profit.

b.     Bank can offer few additional facilities for the customer who are not using overdraft or limit facility. Or based on the economic environment (linking of external data with bank dataset) in the country if organisations are looking for funds for capital expenditure, bank can offer products accordingly.

  (Fig1: Gender wise Product data distribution)                           (Fig2: Yearly Average Balance per Product)
I.        Data Interpretation
a.     There are more female current account customers than male. (Fig 1)
b.     There are few customers who have saving account but their average balance is more than the normal data population. (Fig 2)
II.        Possible Actions:
a.     Create new product or launch campaign or correct current products to increase the male customer base with current account.
b.     Customer can be suggested to move their funds from savings accounts to fixed deposit this would help bank to win the loyalty of customer by showing them bank care for their customer’s money.


(Fig: Quarter wise State data distribution)
I.        Data Interpretation
a.     Alaska and Louisiana has more customer base. Florida has the least customer base.
II.        Possible Actions:
a.     Specific marketing campaigns should be planned to get customer base in the states where number of customer are not enough.
b.     If banking operations (operational efficiency) are the problem for the lower number of customer which can be identified by
                                          i.    Checking the number of accounts/customer trend every year and see if customer base has a diminishing trend in every year. Find the exact problem and take corrective actions for it, one of the example for reducing customer base is the operational efficiency. So perform check about operational efficiency of every branch in Florida and other lower customer base states.
                                         ii.    There was not enough marketing or campaign done as these were not the area of focus in the previous years.

Cluster analysis is one of the important technique to analyse your data, which is not easy specially if you are not statistician/mathematician. R is wonderful tool to do all the analytics work with a ease. I have used R, RStudio and ShinyApp (Web-based framework for R for more practical visualization and statistical analysis )


Thanks to my friend and colleague "Alex" at Misys to introduce me to this wonderful tool and technology  The "R".


Sunday, August 17, 2014

Talend Big Data Integration with Hadoop


Hadoop can be downloaded from the Apache Hadoop website at hadoop.apache.org. This would include core modules like Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce. Additional Hadoop-related projects like Hive, Pig, Hbase and many more can be downloaded from their respective Apache websites. But setting up and doing all this work is not easy.

Once this is done, doing the data-integration is another challenge.
Writing each and everything for the data-integration directly with the pig or hive or other script is bit difficult route to do the required work.

Instead of worrying and doing too much, I think easier way would be to use the sandbox provided  by any of hadoop vendor and use the data integration platform by talend. Good thing is this that both hadoop installation and talend big data integration studio software is free and available under apache license :-)

Some of the hadoop vendors are HortonWorks, Cloudera, MapR, etc.


When I first tried the HortonWorks with talend big data integration platform, it made things very easy.
Of-course for enterprise or cloud there is license. But as I mentioned eaerlier, these are available under apache license and free.

So thought of describing how this can be done. In the later post I can explain how talend and hortonworks can be used for their different features. So this post is kind of overview and later post can be in detail for different features.

So what all we need.
1) Virtual machine where hadoop can be installed.
2) Hadoop applicance to be installed on virtual machine, i.e. HortonWorks sandbox.
3) Talend Open Studio for Big Data
4) Create a job for data load on HDFS.

1) Virtual Machine:
  I have use oracle virtual box from https://www.virtualbox.org. There was an issue with 4.3 latest version so i used 4.3.12 from the link: https://www.virtualbox.org/wiki/Download_Old_Builds_4_3
  One you are finished with installation of virtual box, proceed for next step.

2) Hadoop applicance to be installed on virtual machine, i.e. HortonWorks sandbox:
  a) Download:
  There are multiple sandbox available from HortonWorks for virtualbox, vmvare, Hyper-v. As we have
   installed virtualbox, download that one. It can be donwloaded from the following link:
   http://hortonworks.com/products/hortonworks-sandbox/#install
 
   b) Setup:
   Start virtualbox, Go to file menu and click on Import Alliance.
 
Fig: Importing of HortonWorks hadoop appliance-1.

      Select your appliance(.ova) file and click on next. You would see the import virtual appliance wizard,
      select the appropriate memory and other parameters based on your need. You can set the network

Fig: Importing of HortonWorks hadoop appliance-2.

       Change the network settings to get it's own IP for virtual box. By default it would be NAT change it to          Bridge Adapter.
Fig: Setting of the Network Adapter.
  c) Verify  

  •       You can verify your installation and find the IP of VM running hadoop.
  •       Click on the HortonWorks sandbox and click on start button. It will take some time to boot start all services of hadoop, pig, hive, etc and will prompt for authetication.
  •       Username is "root" and password is "hadoop".
  •       On the command prompt type ifconfig (as its linux) and note down the ip which would be used in talend studio during the HDFSConnection creation.
  •       You can use the IP with Port 8000 (default) to check about the successful installation and running        of services.

         
You can browse the various services at : http://xxx.yyy.xxx.zzz:8000/
Fig: Accessing hadoop services using the browser from remote machine.



You can execute some script or browse the older scripts from the pig menu.
                                                  Fig: Accessing and running pig script from remote machine.



3) Talend open studio:
   You can download the talend open studio for big data (not the data integration ) from the following link:
http://www.talend.com/download?qt-download_landing=0#quicktabs-download_landing

    If you want you can download Talend Big Data Sandbox for any of the hadoop provider but as i wanted to keep these two on separate machine which would be the ideal case.
    It would be an .exe file if you are installation on windows, execute it and select the installation location and you are done with the talend open studio installation.

 4Create a job for data load on HDFS:
     What we want to do here is to create a job  which generate few records and write it to HDFS on the hadoop we have installed in the step 2. Createing and running the job is happening on my local box and hadoop is running on some remote machine.

      At the end it would look like following and execution of the job would create a file of 100 records in HDFS of hadoop.
   
     a) Start talend by starting TOS_BD-win-x86_64.exe
     b) Create a new project and open it.
     c) Design a job: Following steps would make job designing complete.
     
 i)) Create a new job by right click on the Repository->Job Design
                                         
Fig: Create a new job
ii) Create a new HDFSConnection to connect the the HDFS. As this is going to connect to the sandbox which we have installed earlier, provide the IP and Port (8020) for NameNode URI.
Fig: Create a HDFS Connection and set the URI to the sandbox IP and Port (8020)

  iii) Create a tRowGenerator object which would generate the required rows based on the schema created for it. For this example i have created 100 rows. Other information about the schema is available the below images.

Fig: tRowGenerator which would generate 100 rows data.


     Fig: Schema for the data generation

     iv) Link tHDFSConnection to tRowGenerator: Right click on the tHDFSConnection_1 and click on trigger-> On subjob ok and connect to tRowGenerator_1. This would move start the generator work after connection is successful.
  v) Write data to HDFS using tHDFSOutput. Right click on the tRowGenerator_1 and drag it to tHDFSOutput_1. If you need to see the generated output, you can add tLogRow_1 and connect this also in the similar way.

Fig: tHDFSOutput object to write to 

    Now our design is complete, save it.

     d) Run/Execute job to create the file in HDFS. You can either click on the F6 or Run the job by going the Run tab and click on Run button. On Successful execution you would see the filter with the mygeneratedout.csv in the location provided in tHDFSOutput_1, which is "/user/hue/mygeneratedout.csv"

   You can click on the file to see the content.

Thanks for reading the blog. :-)

Soon would be write few more blogs on Hadoop and Talend which explains their specific feature in more details.




Sunday, August 10, 2014

RaspberryPi to run DC Motor using L298N Motor Controller

Raspberry Pi is an awesome and interesting . I was totally astonished after knowing about it from one of my friend and colleague "Alex" at Misys, who keeps sharing many interesting things of this kind.

I did computer application and always had this in mind that I missed to Engineering/Robotics. As I have no knowledge about electronics and mechanics so it really looked difficult for me to try something in this area. One of the main reason is the complexity involved in the Robotics in general (At least complex in my mind till i tried it).
Raspberry PI has made it simple and abstract for the users like me to achieve the same.

They say it is for the kids to enjoy...... I am still thinking, is it really only for kids? Any way i am enjoying this.

You can find more about the Raspberry PI from there website. What is the most amazing thing for me is that with this small credit-card sized computer it can be used to control devices. In this post i am not showing anything different than what others have done. In-fact i have also referred some of them.

I have ordered the following from ebay to make motor running work for me:

RaspberryPI    - 1
DCMotor        - 1
L298N Motor Controller  -  1
PowerSupply (2*2)  -  1
Male-Female and Female-Female connectors.  - Few

RaspberryPI
                                                


DC Motor                                                                                                                                                   











L298N Motor Controller















Following video should give you some idea about  RPi and what it can do.




You can find the python program which was used to control the motor here:



Connectivity for the pins is explained below:


Feeling excited about it by thing about the various opportunities about the application of R_PI.
Already received Raspberry PI Cam, and planning to make my own security system. :-)

This is the most amazing thing I have come across in the recent time.


I faced specific problem with my monitor for HDMI connectivity. My first configuration was with a relatively new monitor (Dell) and it was fine. But when i switched to another monitor (LG), it did not work, there was no display. I struggled for few days but finally found the way and posted the solutions here :

http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=5107&p=595514

Wednesday, August 28, 2013

Runtime instrumentation of bytecode using javaagent with Javassist

I was looking into some classloading issues of websphere application server and encountered with javaagent argument , I found really interesting when i read "The agent class will be loaded by the same classloader which loads the class containing the application main method.", and i started digging more into it.

This was a new thing for me. I used the transformer to see a class loaded from which location. I was doing the same from debugger for multiple class was really time consuming. By using transformer i could find out easily a class was loaded by whicih classloader and from which location using ProtectionDomain, etc. Co-incidently I was working on profiling also and found it can be used there as well.

I wrote small program around it and found it is useful. Many people have written already written in this area but did not find on any blog quickly that how to get a count for method execution, so i thought it writing myself. Of-course many tools like Jensor, Jrat, etc, gives lot more facility but i did it just to get method count and not any other reporting.

So ultimately this blog post covers

  • Javaagent and premain
  • ClassFileTransformer
  • Javassist



  • Javaagent and premain: Introduced in JDK 5, vm argument.. Can augument java bytecode dynamically with the transformer(s) which helps in profiling, bytecode manupulation,. With this dynamic manipulation we can manipulate bytecode during the runtime and which is one of the most useful feature of java. Agent can be added in the JVM argument for your program or server. Agent must implement premain method. 
         A premain method is the one which gets executed before loading the class and before executing main
         method in the class. Javaagent can be added as the JVM argument as follows:

        -javaagent: for e.g. -javaagent:D:\Instrumentation\dsinstrumentation.jar





  • ClassFileTransformer: Agent can add one or more transformer, we have to implement ClassFileTransformer interface. Transformed class which is responsible for transformation of the bytecode. The transformation occurs before the class is defined by the JVM.

          jar which we add as a jarpath for javaagent argument, must have Premain-Class attribute in
          MANIFEST.MF file. Which is as follows




  • Javassist: I have used Javassist for byte code instrumentation. We have multiple choices to achive the same. Users can use Javassist, ASM, BCEL from apache, etc.



You can find/download the source code from my github account repository which included dsinstrumentation.jar as well, which can be used directly.
User need to provide
a) -javaagent
b) configuration system property which should have conf folder with the classinstrumentation.properties file
with the class file name in the following format:



Download : git repository for instrumentation

To run dsinstrumentaion.jar (which is available in the above mentioned download location) as javaagent, give the following jvm argument

-javaagent:<FOLDER_PATH_TO_INSTRUMENTATION_JAR>\dsinstrumentation.jar
-Dconfiguration=<FOLDER_PATH_TO_CONF_FOLDER>\conf

Friday, January 25, 2013

Realization of dream of multitenancy with Google App Engine and MongoDB on cloud.

Multitenancy and the problem of managing the data

Multitenancy refers to a principle in software architecture where a single instance of the software runs on a server, serving multiple client organizations (tenants). Multitenancy is contrasted with a multi-instance architecture where separate software instances (or hardware systems) are set up for different client organizations. With a multitenant architecture, a software application is designed to virtually partition its data and configuration, and each client organization works with a customized virtual application instance. [WIKI]

What if i want to migrate my existing system to any cloud engine?
Move all of code/business logic to app engine, perfect, no problem, but what if my database is already on another physical infrastructure(and which is shredded,etc)?


Of-course app engine/GAE gives you the ability to scale, security, high availability, etc. but with their own terms.You need to use the database which is supported only by the app engine.
Which is one of the major drawback for the GAE and similarly other cloud engines.

I wanted to use MongoDB with Google App Engine(GAE) and could not see the possibility because we can't run MongoDB on GAE and supposed to use Google datastore.

There are few more issues as GAE's highly restricted sandbox. As GAE's docs says your application can only access other computers on the internet through the provided URL fetch and email services (and fewer more ways). Other computers can only connect to the application by making HTTP/HTTPS requests on the standard port. If you want to open a socket from your business logic, its not allowed and GAE will raise an exception.

Few good things happened in this direction, when google annoucned back in 2011 about Google Cloud SQL webservice to support MySQL - https://developers.google.com/cloud-sql/
Some of the other cloud engines have provided more options such as Amazon RDS supports MySQL, Oracle or Microsoft SQL Server database engine. This means that the code, applications, and tools you already use today with your existing databases can be used with Amazon RDS.

I was adamant to move from MongoDB and to use my application with the GAE and mongolab became the savior. Why i want to use my own choice of database because i have already have my own hosted database servers because i want to manage data on my own (security reason, business reason, etc)

Here is the steps how can we use GAE with MongoDB, when your MongoDB instance is hosted on DB cloud.

1) Signup, login and create a database on mongolab


This is going to create mongodb as the database.


2) Create a collection
    Creation of collection (table) can be done by going to database and click on "Add".

3) Open collection which is currently empty and open the API view.
    Connection information on the top of the "Collections" tab is the url which we are going to use interact
    with "mongodb" database created in step1, mongolab has provided REST based methods for database
    interactions.



    At this point we are done with our database part.

    Lets move to GWT(Google Web Toolkit) to create a project.

4) Create a Web Application Project and make sure that "Use Google App Engine" is checked.


    Create a GWT based application, i have used the almost same application which i have created earlier,
    you can refer the steps for the application creation here GWT RPC - Server communication with 
    MongoDB

    The only change here is that the mongodb is hosted on some cloud, and because of that implementation
    for is going to change.

5) Insert a recod and retrive records from cloud database.
    Inserting a record into the databased using the apache's HTTPClient to connect to mongodb hosted on
    cloud, which interacts with the REST based webservice.



    Get all available users from the database.


   As i mentioned earlier about the sandbox security of GAE, i could not use HttpClient directly to invoke
   webserivce and get the data, and had to tweak it bit to make it happen. Shortly going to update about the
  problem as well in this post, which will be describing about GAEConnectionManager.

6) In order to deploy the developed application on cloud, create an an application identifier on app engine.  
    After creation you can see the application as follows:


    Update in /war/WEB-INF/appengine-web.xml and place it under "application" tag.



7) Login to google account from eclipse(you can find it in the bottom left) and deploy the app on the app
    engine.




   And we are done!,  app is available and hosted.
   You can access it from here: http://mongocloudapp.appspot.com/

    I would like to say that overall it is a very nice experience with the app development with GAE and
    without loosing the my own preferred database.

Heroku Custom Trust Store for SSL Handshake

  Working with Heroku for deploying apps (java, nodejs, etc..) is made very easy but while integrating one of the service ho...