Sunday, August 17, 2014

Talend Big Data Integration with Hadoop


Hadoop can be downloaded from the Apache Hadoop website at hadoop.apache.org. This would include core modules like Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce. Additional Hadoop-related projects like Hive, Pig, Hbase and many more can be downloaded from their respective Apache websites. But setting up and doing all this work is not easy.

Once this is done, doing the data-integration is another challenge.
Writing each and everything for the data-integration directly with the pig or hive or other script is bit difficult route to do the required work.

Instead of worrying and doing too much, I think easier way would be to use the sandbox provided  by any of hadoop vendor and use the data integration platform by talend. Good thing is this that both hadoop installation and talend big data integration studio software is free and available under apache license :-)

Some of the hadoop vendors are HortonWorks, Cloudera, MapR, etc.


When I first tried the HortonWorks with talend big data integration platform, it made things very easy.
Of-course for enterprise or cloud there is license. But as I mentioned eaerlier, these are available under apache license and free.

So thought of describing how this can be done. In the later post I can explain how talend and hortonworks can be used for their different features. So this post is kind of overview and later post can be in detail for different features.

So what all we need.
1) Virtual machine where hadoop can be installed.
2) Hadoop applicance to be installed on virtual machine, i.e. HortonWorks sandbox.
3) Talend Open Studio for Big Data
4) Create a job for data load on HDFS.

1) Virtual Machine:
  I have use oracle virtual box from https://www.virtualbox.org. There was an issue with 4.3 latest version so i used 4.3.12 from the link: https://www.virtualbox.org/wiki/Download_Old_Builds_4_3
  One you are finished with installation of virtual box, proceed for next step.

2) Hadoop applicance to be installed on virtual machine, i.e. HortonWorks sandbox:
  a) Download:
  There are multiple sandbox available from HortonWorks for virtualbox, vmvare, Hyper-v. As we have
   installed virtualbox, download that one. It can be donwloaded from the following link:
   http://hortonworks.com/products/hortonworks-sandbox/#install
 
   b) Setup:
   Start virtualbox, Go to file menu and click on Import Alliance.
 
Fig: Importing of HortonWorks hadoop appliance-1.

      Select your appliance(.ova) file and click on next. You would see the import virtual appliance wizard,
      select the appropriate memory and other parameters based on your need. You can set the network

Fig: Importing of HortonWorks hadoop appliance-2.

       Change the network settings to get it's own IP for virtual box. By default it would be NAT change it to          Bridge Adapter.
Fig: Setting of the Network Adapter.
  c) Verify  

  •       You can verify your installation and find the IP of VM running hadoop.
  •       Click on the HortonWorks sandbox and click on start button. It will take some time to boot start all services of hadoop, pig, hive, etc and will prompt for authetication.
  •       Username is "root" and password is "hadoop".
  •       On the command prompt type ifconfig (as its linux) and note down the ip which would be used in talend studio during the HDFSConnection creation.
  •       You can use the IP with Port 8000 (default) to check about the successful installation and running        of services.

         
You can browse the various services at : http://xxx.yyy.xxx.zzz:8000/
Fig: Accessing hadoop services using the browser from remote machine.



You can execute some script or browse the older scripts from the pig menu.
                                                  Fig: Accessing and running pig script from remote machine.



3) Talend open studio:
   You can download the talend open studio for big data (not the data integration ) from the following link:
http://www.talend.com/download?qt-download_landing=0#quicktabs-download_landing

    If you want you can download Talend Big Data Sandbox for any of the hadoop provider but as i wanted to keep these two on separate machine which would be the ideal case.
    It would be an .exe file if you are installation on windows, execute it and select the installation location and you are done with the talend open studio installation.

 4Create a job for data load on HDFS:
     What we want to do here is to create a job  which generate few records and write it to HDFS on the hadoop we have installed in the step 2. Createing and running the job is happening on my local box and hadoop is running on some remote machine.

      At the end it would look like following and execution of the job would create a file of 100 records in HDFS of hadoop.
   
     a) Start talend by starting TOS_BD-win-x86_64.exe
     b) Create a new project and open it.
     c) Design a job: Following steps would make job designing complete.
     
 i)) Create a new job by right click on the Repository->Job Design
                                         
Fig: Create a new job
ii) Create a new HDFSConnection to connect the the HDFS. As this is going to connect to the sandbox which we have installed earlier, provide the IP and Port (8020) for NameNode URI.
Fig: Create a HDFS Connection and set the URI to the sandbox IP and Port (8020)

  iii) Create a tRowGenerator object which would generate the required rows based on the schema created for it. For this example i have created 100 rows. Other information about the schema is available the below images.

Fig: tRowGenerator which would generate 100 rows data.


     Fig: Schema for the data generation

     iv) Link tHDFSConnection to tRowGenerator: Right click on the tHDFSConnection_1 and click on trigger-> On subjob ok and connect to tRowGenerator_1. This would move start the generator work after connection is successful.
  v) Write data to HDFS using tHDFSOutput. Right click on the tRowGenerator_1 and drag it to tHDFSOutput_1. If you need to see the generated output, you can add tLogRow_1 and connect this also in the similar way.

Fig: tHDFSOutput object to write to 

    Now our design is complete, save it.

     d) Run/Execute job to create the file in HDFS. You can either click on the F6 or Run the job by going the Run tab and click on Run button. On Successful execution you would see the filter with the mygeneratedout.csv in the location provided in tHDFSOutput_1, which is "/user/hue/mygeneratedout.csv"

   You can click on the file to see the content.

Thanks for reading the blog. :-)

Soon would be write few more blogs on Hadoop and Talend which explains their specific feature in more details.




Sunday, August 10, 2014

RaspberryPi to run DC Motor using L298N Motor Controller

Raspberry Pi is an awesome and interesting . I was totally astonished after knowing about it from one of my friend and colleague "Alex" at Misys, who keeps sharing many interesting things of this kind.

I did computer application and always had this in mind that I missed to Engineering/Robotics. As I have no knowledge about electronics and mechanics so it really looked difficult for me to try something in this area. One of the main reason is the complexity involved in the Robotics in general (At least complex in my mind till i tried it).
Raspberry PI has made it simple and abstract for the users like me to achieve the same.

They say it is for the kids to enjoy...... I am still thinking, is it really only for kids? Any way i am enjoying this.

You can find more about the Raspberry PI from there website. What is the most amazing thing for me is that with this small credit-card sized computer it can be used to control devices. In this post i am not showing anything different than what others have done. In-fact i have also referred some of them.

I have ordered the following from ebay to make motor running work for me:

RaspberryPI    - 1
DCMotor        - 1
L298N Motor Controller  -  1
PowerSupply (2*2)  -  1
Male-Female and Female-Female connectors.  - Few

RaspberryPI
                                                


DC Motor                                                                                                                                                   











L298N Motor Controller















Following video should give you some idea about  RPi and what it can do.




You can find the python program which was used to control the motor here:



Connectivity for the pins is explained below:


Feeling excited about it by thing about the various opportunities about the application of R_PI.
Already received Raspberry PI Cam, and planning to make my own security system. :-)

This is the most amazing thing I have come across in the recent time.


I faced specific problem with my monitor for HDMI connectivity. My first configuration was with a relatively new monitor (Dell) and it was fine. But when i switched to another monitor (LG), it did not work, there was no display. I struggled for few days but finally found the way and posted the solutions here :

http://www.raspberrypi.org/forums/viewtopic.php?f=28&t=5107&p=595514

Wednesday, August 28, 2013

Runtime instrumentation of bytecode using javaagent with Javassist

I was looking into some classloading issues of websphere application server and encountered with javaagent argument , I found really interesting when i read "The agent class will be loaded by the same classloader which loads the class containing the application main method.", and i started digging more into it.

This was a new thing for me. I used the transformer to see a class loaded from which location. I was doing the same from debugger for multiple class was really time consuming. By using transformer i could find out easily a class was loaded by whicih classloader and from which location using ProtectionDomain, etc. Co-incidently I was working on profiling also and found it can be used there as well.

I wrote small program around it and found it is useful. Many people have written already written in this area but did not find on any blog quickly that how to get a count for method execution, so i thought it writing myself. Of-course many tools like Jensor, Jrat, etc, gives lot more facility but i did it just to get method count and not any other reporting.

So ultimately this blog post covers

  • Javaagent and premain
  • ClassFileTransformer
  • Javassist



  • Javaagent and premain: Introduced in JDK 5, vm argument.. Can augument java bytecode dynamically with the transformer(s) which helps in profiling, bytecode manupulation,. With this dynamic manipulation we can manipulate bytecode during the runtime and which is one of the most useful feature of java. Agent can be added in the JVM argument for your program or server. Agent must implement premain method. 
         A premain method is the one which gets executed before loading the class and before executing main
         method in the class. Javaagent can be added as the JVM argument as follows:

        -javaagent: for e.g. -javaagent:D:\Instrumentation\dsinstrumentation.jar





  • ClassFileTransformer: Agent can add one or more transformer, we have to implement ClassFileTransformer interface. Transformed class which is responsible for transformation of the bytecode. The transformation occurs before the class is defined by the JVM.

          jar which we add as a jarpath for javaagent argument, must have Premain-Class attribute in
          MANIFEST.MF file. Which is as follows




  • Javassist: I have used Javassist for byte code instrumentation. We have multiple choices to achive the same. Users can use Javassist, ASM, BCEL from apache, etc.



You can find/download the source code from my github account repository which included dsinstrumentation.jar as well, which can be used directly.
User need to provide
a) -javaagent
b) configuration system property which should have conf folder with the classinstrumentation.properties file
with the class file name in the following format:



Download : git repository for instrumentation

To run dsinstrumentaion.jar (which is available in the above mentioned download location) as javaagent, give the following jvm argument

-javaagent:<FOLDER_PATH_TO_INSTRUMENTATION_JAR>\dsinstrumentation.jar
-Dconfiguration=<FOLDER_PATH_TO_CONF_FOLDER>\conf

Friday, January 25, 2013

Realization of dream of multitenancy with Google App Engine and MongoDB on cloud.

Multitenancy and the problem of managing the data

Multitenancy refers to a principle in software architecture where a single instance of the software runs on a server, serving multiple client organizations (tenants). Multitenancy is contrasted with a multi-instance architecture where separate software instances (or hardware systems) are set up for different client organizations. With a multitenant architecture, a software application is designed to virtually partition its data and configuration, and each client organization works with a customized virtual application instance. [WIKI]

What if i want to migrate my existing system to any cloud engine?
Move all of code/business logic to app engine, perfect, no problem, but what if my database is already on another physical infrastructure(and which is shredded,etc)?


Of-course app engine/GAE gives you the ability to scale, security, high availability, etc. but with their own terms.You need to use the database which is supported only by the app engine.
Which is one of the major drawback for the GAE and similarly other cloud engines.

I wanted to use MongoDB with Google App Engine(GAE) and could not see the possibility because we can't run MongoDB on GAE and supposed to use Google datastore.

There are few more issues as GAE's highly restricted sandbox. As GAE's docs says your application can only access other computers on the internet through the provided URL fetch and email services (and fewer more ways). Other computers can only connect to the application by making HTTP/HTTPS requests on the standard port. If you want to open a socket from your business logic, its not allowed and GAE will raise an exception.

Few good things happened in this direction, when google annoucned back in 2011 about Google Cloud SQL webservice to support MySQL - https://developers.google.com/cloud-sql/
Some of the other cloud engines have provided more options such as Amazon RDS supports MySQL, Oracle or Microsoft SQL Server database engine. This means that the code, applications, and tools you already use today with your existing databases can be used with Amazon RDS.

I was adamant to move from MongoDB and to use my application with the GAE and mongolab became the savior. Why i want to use my own choice of database because i have already have my own hosted database servers because i want to manage data on my own (security reason, business reason, etc)

Here is the steps how can we use GAE with MongoDB, when your MongoDB instance is hosted on DB cloud.

1) Signup, login and create a database on mongolab


This is going to create mongodb as the database.


2) Create a collection
    Creation of collection (table) can be done by going to database and click on "Add".

3) Open collection which is currently empty and open the API view.
    Connection information on the top of the "Collections" tab is the url which we are going to use interact
    with "mongodb" database created in step1, mongolab has provided REST based methods for database
    interactions.



    At this point we are done with our database part.

    Lets move to GWT(Google Web Toolkit) to create a project.

4) Create a Web Application Project and make sure that "Use Google App Engine" is checked.


    Create a GWT based application, i have used the almost same application which i have created earlier,
    you can refer the steps for the application creation here GWT RPC - Server communication with 
    MongoDB

    The only change here is that the mongodb is hosted on some cloud, and because of that implementation
    for is going to change.

5) Insert a recod and retrive records from cloud database.
    Inserting a record into the databased using the apache's HTTPClient to connect to mongodb hosted on
    cloud, which interacts with the REST based webservice.



    Get all available users from the database.


   As i mentioned earlier about the sandbox security of GAE, i could not use HttpClient directly to invoke
   webserivce and get the data, and had to tweak it bit to make it happen. Shortly going to update about the
  problem as well in this post, which will be describing about GAEConnectionManager.

6) In order to deploy the developed application on cloud, create an an application identifier on app engine.  
    After creation you can see the application as follows:


    Update in /war/WEB-INF/appengine-web.xml and place it under "application" tag.



7) Login to google account from eclipse(you can find it in the bottom left) and deploy the app on the app
    engine.




   And we are done!,  app is available and hosted.
   You can access it from here: http://mongocloudapp.appspot.com/

    I would like to say that overall it is a very nice experience with the app development with GAE and
    without loosing the my own preferred database.

Saturday, November 17, 2012

SQLite database on Android platform


Android provides nice way of storing data into database and this is possible with its internal library SQLite. SQLite is a very light weight database and its included into android's library stack.
See android architecture here for more details.

     + 

SQLite is opensource database, to learn more about the SQLite refer here.

This example demonstrate how SQLite can be used in Android application.
I have used few basic example of create database and table, insert record, select record from table and delete table.

This blog also explains the example for the android basic widgets which includes TextView and buttons. If user wants they can use other external tool for the development of gui ( for eg. droiddraw , etc), i have have done most of the gui development by modifying the xml directly or with the default gui builder.

Create Database


Create Table


Insert record into table


Query SQLite table


Delete table and close database connection




This is a simple android which which shows a thought of the day to a user, if user wants to see more thoughts than user can press next else user can press thanks button. Purpose of this post is just to show the capability of SQLite, so there are only 20 records available in the table and they are shown in some random number.





You can either checkout or download this sample application code from here.

Monday, August 6, 2012

GWT Editable Table - CellTable with remove row


GWT offers table where we can specify column types, and make rows data modifiable.

CellTable supports paging and columns, i have not used paging in this example but only specified the columns for text and button.




The Column class defines the Cell used to render a column.

Implement Column.getValue(Object) to retrieve the field value from the row object that will be rendered in the Cell.



and the data which will be rendered:

You can use TextCell instead of EditTextCell, in case you do not want to make column editable.


DataProvider, ListDataProvider can display data which should be provided as list.

I have used modal User to create few items in the list and provide the same to the ListDataProvider and which works as the modal for the celltable.



Remove row from the dataProvider and table.


When user clicks on "x" update gets called from updater and where we can remove the row from the modal and refresh modal and re-draw table (i found without refresh modal and redraw table call also it was working fine).

Example files are available to download <<==== click here to download.
Look for GwtExamples.rar (v.1) 

Tuesday, July 24, 2012

GWT RPC - Server communication with MongoDB

Further to the client tutorial in the previous blog, here i am going to present how client communicates with the server with the help of GWT RPC (Remote Procedure Call).
Using RPC which works asynchronously only specific part of the client components can be updated without updating the complete client.














Pre-Requisite:
Java, Eclipse with GWT plugin, MongoDB.



In this example i have extended the previous client to
  a) save data i.e. user data to database and
  b) authentication user

Client can send/receive serializable objects to/from the server. In this example server is going to use that data and insert in into DB, which is MongoDB in our case.


Create Modal
As the client can send serializable java objects over HTTP. Lets create a modal first.


User.java



Create Service
Now we need to have a service which a client can use.
In our case we name it as MongoDBService and it implements RemoteService


MongoDBService.java



@RemoteServiceRelativePath("dbservice") is annotation used for the service identification and calling, which should match with the service path defined in web.xml for servlet-mapping tag.

We need to create Asynchronous service for our MongoDBService, client is actually going to call RPC through MongoDBServiceAsync. All methods of MongoDBService will have extra parameter which is of type AsyncCallback. Client call will be notified when any asynchronous service call completes.

MongoDBServiceAsync.java



Implement service at the server side
Create class MongoDBServiceImpl which extends RemoteServiceServlet and implements MongoDBService. Basically MongoDBServiceImpl is a servlet which is extending from RemoteServiceServlet rather than the HttpServlet directly.



Updating web.xml with servlet details
make sure that all the service which are created with the annotation RemoteServiceRelativePath are added properly for each servlet.

web.xml



url pattern tag path should be formed using module/service.
In this case module name is gwtwebappdemo ( see in GwtWebAppDemo.gwt.xml for rename-to value) and service in dbservice.


Client Service Call
calling service from client





Running application

1) Start MongoDB
    In this case i have started DB without authentication and on default port.
    Default port is 27017 which we have used to connect to database in the code. See DBUtil.java in the
   example.



2) Add user into database.
    Client calls asynchronous call to server, server starts the processing. Once request processing is finished it calls back with the call back handler provided in the request. Client gets call in onFailure in case of failure and onSuccess in case of request processing successful.


    In the server side, server received the User object and sets the data into User collection which is in the mymongodb database using BasicDBObject.

MongoDBServiceImpl.java


3) Check the collection ( table) creation in dbs (database)
    In this example mymongodb is the database and User is the collection which we are using.
    Initially both database and collection is not present, when the first save happens. Both gets automatically
    created.



4) Check User Authentication with wrong input.



Download complete example from here in GwtWebAppDemo_server.rar file.

Heroku Custom Trust Store for SSL Handshake

  Working with Heroku for deploying apps (java, nodejs, etc..) is made very easy but while integrating one of the service ho...