XAP 9.1 Documentation > Back to Table of Contents
Your First Real Time Big Data Analytics Application
Summary: How to use XAP for Real-time analysis of Big Data
Introduction
The ChallengeTwitter users aren't just interested in reading tweets of the people they follow; they are also interested in finding new people and topics to follow based on popularity. This poses several challenges to the Twitter architecture due to the vast volume of tweets. In this example, we focus on the challenges relating to calculating the word count use case. The challenge here is straightforward:
These challenges are not simple to deal with as there are knock-on effects from the volume and analysis of the data, as follows:
Solution Architecture
Implementing the Solution as a XAP Application
Building the ApplicationThe following are step-by-step instructions building the application: 1. Download and install XAP 2. Getting the Application Please download the application and unzip it under <XapInstallationRoot>/recipes/apps folder. A folder streaming-bigdata should be created. If you already have the streaming-bigdata folder , please remove first it and replace it with the content of the downloaded zip file. 3. Install Maven and the GigaSpaces Maven plug-in
4. Building the Application <properties> <gsVersion>9.1.2-RELEASE</gsVersion> </properties> To Build the project type the following at your command (Windows) or shell (*nix): mvn package
If you are getting No gslicense.xml license file was found in current directory error, please run the following: mvn package -DargLine="-Dcom.gs.home="<XapInstallationRoot>" Where XapInstallationRoot should be XAP root folder - example: mvn package -DargLine="-Dcom.gs.home="c:\gigaspaces-xap-premium-9.0.0-ga" The Maven build will download the required dependencies, compile the source files, run the unit tests, and build the required jar files. In our example, the following processing unit jar files are built:
Once the build is complete, a summary message similar to the following is displayed: [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] rt-analytics ...................................... SUCCESS [0.001s] [INFO] rt-common ......................................... SUCCESS [2.196s] [INFO] rt-processor ...................................... SUCCESS [11.301s] [INFO] rt-feeder ......................................... SUCCESS [3.102s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 16.768s [INFO] Finished at: Sun May 13 13:38:06 IDT 2012 [INFO] Final Memory: 14M/81M [INFO] ------------------------------------------------------------------------ Running and Debugging the Application within an IDESince the application is a Maven project, you can load it using your Java IDE and thus automatically configure all module and classpath configurations.
Once the project is loaded in your IDE, you can run the application, as follows:
rt-processor project run configuration:
For more information about the IntegratedProcessingUnitContainer class (runs the processing units within your IDE), see Running and Debugging Within Your IDE.
To run the application, run the processor configuration, and then the feeder configuration. An output similar to the following is displayed: 2013-02-22 13:09:38,524 INFO [org.openspaces.bigdata.processor.TokenFilter] - filtering tweet 305016632265297920 2013-02-22 13:09:38,526 INFO [org.openspaces.bigdata.processor.FileArchiveOperationHandler] - Writing 1 object(s) to File 2013-02-22 13:09:38,534 INFO [org.openspaces.bigdata.processor.TweetArchiveFilter] - Archived tweet 305016632265297920 2013-02-22 13:09:38,535 INFO [org.openspaces.bigdata.processor.LocalTokenCounter] - local counting of a bulk of 1 tweets 2013-02-22 13:09:38,537 INFO [org.openspaces.bigdata.processor.LocalTokenCounter] - writing 12 TokenCounters across the cluster 2013-02-22 13:09:38,558 INFO [org.openspaces.bigdata.processor.GlobalTokenCounter] - Increment local token arrive by 1 2013-02-22 13:09:38,606 INFO [org.openspaces.bigdata.processor.GlobalTokenCounter] - Increment local token Reine by 1 2013-02-22 13:09:38,622 INFO [org.openspaces.bigdata.processor.GlobalTokenCounter] - Increment local token pute by 1 2013-02-22 13:09:38,624 INFO [org.openspaces.bigdata.processor.GlobalTokenCounter] - Increment local token lycée by 2 2013-02-22 13:09:41,432 INFO [org.openspaces.bigdata.processor.TweetParser] - parsing tweet SpaceDocument ..... 2013-02-22 13:09:41,440 INFO [org.openspaces.bigdata.processor.TokenFilter] - filtering tweet 305016630734381057 2013-02-22 13:09:41,441 INFO [org.openspaces.bigdata.processor.FileArchiveOperationHandler] - Writing 1 object(s) to File 2013-02-22 13:09:41,447 INFO [org.openspaces.bigdata.processor.LocalTokenCounter] - local counting of a bulk of 1 tweets 2013-02-22 13:09:41,448 INFO [org.openspaces.bigdata.processor.LocalTokenCounter] - writing 11 TokenCounters across the cluster 2013-02-22 13:09:41,454 INFO [org.openspaces.bigdata.processor.TweetArchiveFilter] - Archived tweet 305016630734381057 2013-02-22 13:09:41,463 INFO [org.openspaces.bigdata.processor.GlobalTokenCounter] - Increment local token Accounts by 1 2013-02-22 13:09:41,485 INFO [org.openspaces.bigdata.processor.GlobalTokenCounter] - Increment local token job by 1 2013-02-22 13:09:41,487 INFO [org.openspaces.bigdata.processor.GlobalTokenCounter] - Increment local token time by 1 Switching between Online Feeder and the Test FeederFor testing purposes, you can switch between the On-Line TwitterHomeTimelineFeederTask Feeder and the Test ListBasedFeederTask Feeder. The former uses real-time Twitter time line data, while the latter uses simulated tweet data. By default, TwitterHomeTimelineFeederTask is enabled. To switch to ListBasedFeederTask, comment out the @Component line at the top of the TwitterHomeTimelineFeederTask source file, and uncomment the same line in the ListBasedFeederTask source file. Then rebuild the project using the following command: mvn package
Running the Application with XAP Runtime EnvironmentThe following are step-by-step instructions for running the application in XAP:
Unix
Windows gs deploy ..\recipes\apps\streaming-bigdata\processor\target\rt-processor-XAP-9.1.jar You should see the following output: Deploying [rt-processor-XAP-9.1.jar] with name [rt-processor-XAP-9.1] under groups [gigaspaces-9.1.2-XAPPremium-ga] and locators [] Uploading [rt-processor-XAP-9.1] to [http://127.0.0.1:61765/] Waiting indefinitely for [4] processing unit instances to be deployed... [rt-processor-XAP-9.1.2] [1] deployed successfully on [127.0.0.1] [rt-processor-XAP-9.1.2] [1] deployed successfully on [127.0.0.1] [rt-processor-XAP-9.1.2] [2] deployed successfully on [127.0.0.1] [rt-processor-XAP-9.1.2] [2] deployed successfully on [127.0.0.1] Finished deploying [4] processing unit instances Next - deploy the feeder:
Unix
./gs.sh deploy <applicationRoot>/feeder/target/rt-feeder-XAP-9.x.jar Windows gs deploy ..\recipes\apps\streaming-bigdata\feeder\target\rt-feeder-XAP-9.1.jar
You should see the following output: Deploying [rt-feeder-XAP-9.1.jar] with name [rt-feeder-XAP-9.1] under groups [gigaspaces-9.1.2-XAPPremium-ga] and locators [] Uploading [rt-feeder-XAP-9.1] to [http://127.0.0.1:61765/] SLA Not Found in PU. Using Default SLA. Waiting indefinitely for [1] processing unit instances to be deployed... [rt-feeder-XAP-9.1] [1] deployed successfully on [127.0.0.1] Finished deploying [1] processing unit instances Once the application is running, you can use the XAP UI tools to view your application , access the data and the counters and manage the application:
Viewing Most Popular Words on TwitterTo view the most popular words on Twitter , start the GS-UI using the gs-ui.bat/sh , click the Query icon as demonstrated below and execute the following SQL Query by clicking the select uid,* from org.openspaces.bigdata.common.counters.GlobalCounter order by counter DESC You should see the top most popular words on twitter ordered by their popularity: You can re-execute the query just by clicking the Persisting to CassandraOnce raw tweets are processed, they are moved from the Space to the historical data backend store. By default, this points to a simple flat file storage implemented with the FileArchiveOperationHandler. The example application also includes a Cassandra driver CassandraArchiveHandler.
The following are step-by-step instructions configuring the application to persist to Cassandra: <os-archive:cassandra-archive-handler id="cassandraArchiveHandler" giga-space="gigaSpace" hosts="localhost" port="9160" keyspace="TWITTER" write-consistency="QUORUM"/> <os-archive:archive-container id="archiveContainer" giga-space="gigaSpace" archive-handler="cassandraArchiveHandler" concurrent-consumers="${archiver.threads}" max-concurrent-consumers="${archiver.threads}" batch-size="100" > <os-archive:tx-support tx-manager="transactionManager"/> <os-core:template ref="archiverFilter" /> <os-archive:exception-handler ref="archiverFilter"/> </os-archive:archive-container> Make sure the cassadra-discovery} and the {{fileArchiveHandler beans are commented out and not being instantiated. 2. Download, install, and start the Cassandra database. For more information, see Cassandra's Getting Started page. <cassandra home>/bin/cassandra-cli --host <cassandra host name> --file <project home>/processor/cassandra-schema.txt 4. Build and Deploy the application as described in the previous section. You will need to undeploy the existing Processor and Feeder before you deploy the new versions. You can view the data within Cassandra using the Tweet column family - Move to the Cassandra bin folder and run the cassandra-cli command: >cassandra-cli.bat [default@TWITTER] connect localhost/9160; [default@TWITTER] use TWITTER; [default@TWITTER] list Tweet; ------------------- RowKey: 0439486840025000 => (column=Archived, value=00000000, timestamp=1361398666863002) => (column=CreatedAt, value=0000013cf9aea1c8, timestamp=1361398666863004) => (column=FromUserId, value=0000000039137bb7, timestamp=1361398666863003) => (column=Processed, value=01, timestamp=1361398666863001) => (column=Text, value=405f5f4c6f7665526562656c64652073656775696e646f2021205573613f20234172746875724d65754d61696, timestamp=136139866 6863000) ... Running the Example using CloudifyTo run the application with the Cassandra DB as one application on any cloud, we will use Cloudify. A key concept with Cloudify is deploying and managing the entire application life cycle using a Recipe. This approach provides total application life-cycle automation without any code or architecture change. Furthermore, it is cloud neutral so you don't get locked-in to a specific cloud vendor. The following snippet shows the application's recipe: application { name="big_data_app" service { name = "feeder" dependsOn = ["processor"] } service { name = "processor" dependsOn = ["cassandra"] } service { name = "cassandra" } } The following snippet shows the life-cycle events described in the Cassandra service recipe: service { name "rt_cassandra" icon "Apache-cassandra-icon.png" numInstances 1 type "NOSQL_DB" lifecycle{ init "cassandra_install.groovy" preStart "cassandra_prestart.groovy" start "cassandra_start.groovy" postStart "cassandra_poststart.groovy" } ... } The following snippet shows the processing unit described in the processor recipe: service { name "processor" numInstances 4 maxAllowedInstances 4 statefulProcessingUnit { binaries "rt-analytics-processor.jar" sla { memoryCapacity 512 maxMemoryCapacity 512 highlyAvailable true memoryCapacityPerContainer 128 } } } The application recipe is packaged, as follows: Testing the application on a Local CloudXAP comes with a cloud emulator called localcloud. It allows you to test the recipe execution on your local machine. Follo these step-by-step instructions to installing and run the application on the localcloud:
For more information, see Deploying Applications page. Running on CloudsTo run the application on one of the supported clouds, proceed the following steps:
|
![]() |
GigaSpaces.com - Legal Notice - 3rd Party Licenses - Site Map - API Docs - Forum - Downloads - Blog - White Papers - Contact Tech Writing - Gen. by Atlassian Confluence |