Distribute with Hazelcast, Persist into HBase
In this article I will implement a solution for a Big Data scenario.
I will use HBase as persistence layer, and Hazelcast as distributed cache.
So the resulting project will be a “Getting Started Sample” for ones who wants to use HBase as persistent storage for their Hazelcast application.
Suppose you have (or hope to have:) ) “User” data with billions of records. -> Big Data
People will reach the data from your web application; query them, search them… -> Real-time Access
Some records will be reached more frequently -> Cache them in memory, serve faster.
Can add/remove columns, no strict schema -> Sparse data
Given the main requirements, the solution “NoSQL + Distributed Cache” fits to our scenario.
I will persist user data to the HBase:
A no-sql key-value datastore based on Hadoop technology and specialized for Big Data requirements.
It is modeled after Google’s Big table and used by Yahoo and Facebook.
Facebook prefered HBase over Cassandra for its messaging system.
To learn more
I will cache and distribute the data with Hazelcast.
HBase is intended to be used in cluster but it has a standalone mode that you can try and use for development purposes.
For HBase setup follow:
If you use Ubuntu, you will encounter problems.
Although windows is not recommended for production, still you can try HBase on Windows.
Hazelcast is deadly simple to use. Just download and add hazelcast.jar to your classpath.
If you are new to hazelcast have a look at:
Create a maven Java project with dependencies:
Create a User pojo:
Create the user table in HBase:
Run your hbase by,
Here it will be good to check the logs, to be sure it is installed and started properly.
Then open the HBASE shell by,
HBASE_DIR> ./bin/hbase shell
Create the user table
hbase(main):008:0> create ‘user’, ‘cf_basic’, ‘cf_text’
Here I should tell more about ‘cf_basic’ and ‘cf_text’. These are column families.
Column families are stored together in the disk with the same storage specifications.
For example if you want some type of data (e.g. images) to be compressed then make them the same column family so you can define the same storage rule for them.
Here we have two column families: ‘cf_basic’ is for simple types, numbers, strings and ‘cf_text’ is for long text columns.
Notice that we have done nothing about schema, column types etc.
In the HBase intro video, you will recall Todd uses the term “datastore” instead “database” defining HBase.
HBase (and other key-value stores) is more like a persisted HashMap than a database.
You gain scalability but lose complex queries.
This is the class where hazelcast will call at each map operation.
And a singleton service for getting HBase table.
On map.get hazelcast will look at HBase if it can not find the key in memory. Similarly when you put an element to map, hazelcast will persist it to HBase.
Why have not we implemented the loadAll? loadAll and loadAllKeys methods are for initially filling the hazelcast map from database. As we expect millions of records, it is not feasible to load db to memory. So we left them empty.
Unfortunately HTable is not thread safe, so you have to handle concurrency.
Here is hazelcast.xml that we put to classpath.
First difference from default one is I have added mapstore declaration to map config part.
Secondly I have enabled the eviction on maps. You can use hazelcast as a distributed cache by enabling eviction. So hazelcast evicts (removes) expired entries. To enable eviction set eviction-policy to LRU (or LFU) and max-size. For more information about hazelcast eviction see:
Run The Code
Now let’s test it.
And see the records in database:
hbase(main):055:0> get ‘user’, ‘u-6’
cf_basic:age timestamp=1334320415281, value=\x00\x00\x00\x1D
cf_basic:location timestamp=1334320415281, value=Istanbul
cf_basic:name timestamp=1334320415281, value=Mehmet Dogan
cf_text:details timestamp=1334320415281, value=software developer …..
4 row(s) in 0.0150 seconds
Write-Through and Write-Behind
The default configuration of map-store is write-through: records are synchronously persisted to datastore.
If you set write-delay-seconds in hazelcast.xml to a positive value then the behaviour will be write-behind.
The entries added will be persisted after n seconds.
deleteAll and storeAll methods implemented in mapstore are used in write-behind mode.
If you do not want to map your objects manually; you can use Kundera.
It is JPA compliant ORM for Big data.
You can reach the example project code: