Tag Archive: cassandra


Running Hadoop MapReduce With Cassandra NoSQL

So if you are looking for a good NoSQL read of HBase vs. Cassandra you can check out http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/.  In short HBase is good for reads and Cassandra for writes.  Cassandra does a great job on reads too so please do not think I am shooting either down in any way.  I am just saying that both HBase and Cassandra have great value and useful purpose in their own right and even use cases exists to run both.  HBase recently got called up as a top level apache project coming up and out of Hadoop.

Having worked with Cassandra a bit I often see/hear of folks asking about running Map/Reduce jobs against the data stored in Cassandra instances.  Well Hadoopers & Hadooperettes the Cassandra folks in the 0.60 release provide a way to-do very nicely.   It is VERY straight forward and well thought through.  If you want to see the evolution check out the JIRA issue https://issues.apache.org/jira/browse/CASSANDRA-342

So how do you it?  Very simple, Cassandra provides an implementation of InputFormat.  Incase you are new to Hadoop the InputFormat is what the mapper is going to use to load your data into it (basically).  Their subclass connects your mapper to pull the data in from Cassandra.  What is also great here is that the Cassandra folks have also spent the time implementing the integration in the classic “Word Count” example.

See https://svn.apache.org/repos/asf/cassandra/trunk/contrib/word_count/ for this example.  Cassandra rows or row fragments (that is, pairs of key + SortedMap of columns) are input to Map tasks for processing by your job, as specified by a SlicePredicate that describes which columns to fetch from each row. Here’s how this looks in the word_count example, which selects just one configurable columnName from each row:

ConfigHelper.setColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);

Cassandra also provides a Pig LoadFunc for running jobs in the Pig DSL instead of writing Java code by hand. This is in https://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

Cassandra vs HBase

We are an ad network. We need to store impressions and clicks. We were evaluating various big data (or nosql -what ever you may choose to call it) systems for our new project. We had used HBase for past 8 months in an experimental product and are satisfied with it but the hype about Cassandra was so much that we decided to give it a shot. I think for some reason Cassandra team has succeeded in marketing itself very well. You will find even non technical people (such as VCs, CEOs, Product Managers) in Santa Monica recommending Cassandra to each other.

First impression of Cassandra was a good one. Their web page looks much more professional and nicer than HBase. It’s very easy to get it up and running. The website is well documented. Literally, it took me 5 minutes to set it up and get it running.

The real challenge was to understand Cassandra data model and try to implement it for our use cases. It was very clear to us how we will do it with HBase as we have a pretty good experience with it. Even though Cassandra inherits the same data model from BigTable, there are some fundamental differences between Cassandra and HBase. I have tried to tabulate the differences between the two systems below:

Cassandra HBase
Lacks concept of a Table. All the documentation will tell you that it’s not common to have multiple keyspaces. That means you have to share a key space in a cluster. Furthermore adding a keyspace requires a cluster restart! Concept of Table exists. Each table has it’s own key space. This was a big win for us. You can add and remove table as as easily as a RDBMS.
Uses string keys. Very common to use uuids as the keys. You can use TimeUUID if you want your data to be sorted by time. Uses binary keys. It’s common to combine three different items together to form a key. This means you can search by more than one key in a give table.
Even if you use TimeUUID, as Cassandra load balances client requests, hot spotting problem won’t occur. (All the client requests going to one server in a cluster is known as a hot spot problem). If your key’s first component is time or a sequential number, then hotspotting occurs. All of the new keys will be inserted to one region until it fills up (hence by causing a hotspotting problem).
Offers sorting of columns. Does not have sorting of columns.
Concept of Supercolumn allows you to design very flexible, very complex schemas. Does not have supercolumns. But you can design a super column like structure as column names and values are binary.
Does not have any convinience method to to increment a column value. In fact the vary nature of eventual consistency makes it difficult to update/write a record and read it instantly after the update. You have to make sure that R + W > N to achive strong consitency. By design consitent. Offers a nice convinience method to increment counters. Very much suitable for data aggregation.
Map Reduce support is new. You will need a hadoop cluster to run it. Data will be tranferred from Cassandra cluster to the hadoop cluster. No suitable for running large data map reduce jobs. Map Reduce support is native. HBase is built on Hadoop. Data does not get transferred.
Comparatively simpler to maintain if you don’t have to have hadoop. Comparatively complicated as you have it has many moving pieces such as Zookeeper, Hadoop and HBase itself.
Does not have a native java api as of now. No java doc. Even though written in Java, you have to use Thrift to communicate with the cluster. Has a nice native java api. Feels much more java system than Cassandra. Being a java shop, it was important for us. HBase has a thrift interface for other laguages too.
No master server, hence no single point of failure. Although there exists a concept of a master server, HBase itself does not depend on it heavily. HBase cluster can keep serving data even if the master goes down. Hadoop namenode is a single point of failure.

After comparing the data model and features in this way, HBase was the clear winner for us. In my opinion, if consistency is what you need, HBase is a clear choice. Furthermore native map reduce, concept of a table and simpler schema that can be modified without a cluster restart are a big plus you cannot ignore. HBase is a much more mature platform. When people site Twitter, Facebook using Cassandra, they forget that the same organizations are using HBase too. In fact one of the committers of HBase was recenlty hired by Facebook which clearly shows their interest in HBase.

So in short, we will root for team HBase!

Motivation

At Medallia, a key component of our system currently works with an open source relational db. Since this component mainly queries the db entries by key, we want to try to switch to a key-value storage system and take advantage of several benefits provided by such a system, including distributed replication, load balancing, and failover. One of our objectives is to re-architect this component in a way that will allow us to achieve horizontal scalability, that among other things will help us alleviate the high disk storage requirements we currently have.

Recently we took the time to look into this (and other technological improvements too, exciting times at Medallia right now!), and we reviewed several options. To make a long story short, we ended up with two finalists, Apache Cassandra and Project Voldemort.

These two projects seem to be the most mature open source options in their class, and both provide a native decentralized clustering support including partitioning, fault tolerance, and high availability. Both are based on Amazon’s Dynamo paper, but the main difference is that Voldemort follows a simple key/value model, while Cassandra uses a persistency model based onBigTable‘s column oriented model. Both provide support for read-consistency where read operations always return latest data, which was a requirement for us.

High level comparison

Project Voldemort

While not an exhaustive list, these are the most relevant pros and cons we identified when reviewing both stores:

  • Pros
    • Simpler API
    • Persistency based on Berkley DB, a mature and well-known key/value db
    • Uses Vector Clocks instead of simple timestamps. It doesn’t need the nodes (or clients) clocks to be synchronized
  • Cons
    • No built-in support for “multiple data center”-aware routing (meaning there must be 1 copy of each key in at least one data center)

Apache Cassandra

  • Pros
    • Broader range of systems in production (Facebook, Twitter, Digg, Rackspace)
    • Richer API which supports values with a dynamic column structure. The columns can evolve independently, meaning that you can update one column without reading the whole structure.
    • Optimized for writes (by design)
    • Configurable consistency level (specified on each request)
  • Cons
    • File format is still in development, changes to the internal structure are likely to happen. Due to the flexibility it provides, the file format is more complex and harder to reason with, especially in terms of performance
    • Requires Clock Synch (NTP) (for nodes and clients)
    • Reads are more disk-intensive than competitors
    • Doesn’t support client conflict resolution, so the latest update always wins

Performance Tests

To our surprise this was the only link we’ve found that compares the performance for both projects – thus we decided to write this post to share our research. We used the vpork test framework, which we modified to suit our needs by upgrading the client code to the latest versions, adding a warm-up phase, and adding rewrite capabilities. These are the results of our tests:

Setup:

  • Versions
    • Voldemort v0.80.1
    • Cassandra 0.6.0-beta3
  • Boxes: 3 similar nodes with the following spec:
    • 4GB maximum heap size
    • Replication parameters: N=3 (replicas for each entry), R=2 (nodes to wait for on each read), W=2 (nodes to block for on each write)
    • 8 processors on each server (Intel(R) Xeon(R) CPU E5504 @ 2.00GHz)
    • 1TB disk space (Seagate ST31000340NS, 7200 RPM, 32MB cache)
  • Persistence parameters
    • Voldemort (default values)
      • key-serializer: string
      • value-serializer: identity (byte array)
      • persistence engine=bdb (Berkley DB)
      • bdb.cache.size=1536MB
    • Cassandra
      • ColumnFamily definition: CompareWith=”BytesType” RowsCached=”10000″
      • ReplicationFactor=3
      • Partitioner=org.apache.cassandra.dht.RandomPartitioner
      • ConcurrentReads=16
      • ConcurrenWrites=32
  • Tests
    • Client Threads: 40
    • Initial load: 5 million records – records present before starting each test
    • WarmUp: 20K records – initial writes before measuring time for each test
    • Number of operations per test: 500K

We ran tests for 4 different write-rewrite-read configurations. A write is equivalent to a put operation with a new record (non-existing key). A rewrite is a put operation with an existing key. A read is a get operation on an existing key. These are the configurations we tested:

  • 50% Write 50% Read
  • 10% Write 40% Rewrite 50% Read
  • 50% Rewrite 50% Read
  • 90% Rewrite 10% Read

We ran all the tests for two different value sizes, 15 and 1.5 KB. Even though we evaluated different options, for our current needs, the last one with a 15 KB data entry was the most representative scenario.

The first pair of charts shows the latency, or average time it takes a read or write operation to complete in each case. Lower values are better. As expected, Cassandra write (and rewrite) times were consistently faster than Voldemort, while read times varied a bit depending on the scenario but were more or less the same in general.

test-avg-lat-1.5.png
test-avg-lat-15.png
The second pair of charts shows the maximum time in the best 99% of cases; again lower values are better:

test-99-lat-1.5.png

test-99-lat-15.png

On the front-end, we have a write-back cache which means that write operations don’t affect the user experience. On the other hand, read operations are directly related to page loads. That’s why we were concerned about the peak for Cassandra read in the last scenario for 15KB. We ran some further tests to measure the 99.9% and 99.99% percentiles and the difference was even greater: 5050 ms for Cassandra and 748 ms for Voldemort in the first case, and 9176 ms against 1129 ms in the second case. This huge difference was a key decision factor for us.

Finally, these two charts show the general throughput in terms of operations (read or write) per second. In this case higher values are better:

test-ops-1.5.png
test-ops-15.png

Notes:

  • Cassandra commit log and data folder are supposed to be placed at different disks to improve performance, we tested with both on the same disk.

Issues found while testing:

  • Voldemort client.put(K key, V value) (not the one that takes a Version object) throws ObsoleteVersionException if called with the same key from different threads. The javadoc states “Associated (sic) the given value to the key, clobbering any existing values stored for the key. “, so this was not expected.

And the winner is …

I think there is no clear winner, in general terms. The best option depends on many factors that each company has to evaluate. My preference changed a few times during the review and tests.

Having said that, we had to choose one, and we decided to go with Project Voldemort. The main reasons were the simplicity, better versioning control, persistency layer maturity, and latency predictability.

We are currently developing the new solution, and it will take some time before we can put it in production, but we wanted to share our preliminary results with everyone who is considering one of these two options, so they’ll have one more tool at the time of the decision.

We’ll keep you posted on how it goes.

Diego Erdody
Lead Software Engineer
Other useful articles comparing different key-value stores:

Digg từ bỏ MySQL, chuyển sang dùng NoSQL

Trang bầu chọn nội dung lớn nhất trên Internet Digg.com đang trong quá trình thay đổi toàn bộ hạ tầng phần mềm nhằm tăng tốc ứng dụng cũng như mở rộng mạng lưới. Một trong những nội dung quan trọng nhất đó là nỗ lực thay thế gần như toàn bộ cơ sở dữ liệu nguồn mở nổi tiếng nhất mà họ đã sử dụng từ ngày thành lập cho đến nay, MySQL.

Cassandra logo

Thay cho MySQL – một cơ sở dữ liệu quan hệ, vốn là loại cơ sở dữ liệu phổ biến nhất – là Cassandra, một loại cơ sở dữ liệu không phải là quan hệ (được gọi chung là các cơ sở dữ liệu NoSQL). Cassandra vốn là 1 sản phẩm nguồn mở của Facebook, nay nằm dưới sự điều hành của Apache Software Foundation.

Theo các lập trình viên của dự án Cassandra, cơ sở dữ liệu này hiện đang được dùng bởi Rackspace, Facebook, Twitter. Và nay danh sách này có thêm Digg.

John Quinn, phó chủ tịch phụ trách kỹ thuật tại Digg cho biết họ đã phát triển một công cụ giúp cho việc chuyển đổi từ MySQL sang Cassandra được dễ dàng và nhanh chóng hơn (với sự trợ giúp của Hadoop — cũng là một dự án của Apache Software Foundation về ứng dụng phân tán). Công cụ này sẽ được mở nguồn trong thời gian tới nhằm hỗ trợ những cá nhân hay tổ chức có ý định chuyển đổi tương tự.

Giải thích về nguyên nhân thay thế MySQL, Quinn cho biết Cassandra là một mô hình cơ sở dữ liệu phân tán hoàn toàn, có khả năng chịu lỗi cực tốt. Một tính chất nữa là nó rất linh hoạt, tốc độ đọc/ghi tăng tuyến tính khi bổ sung thêm hạ tầng mới. Những đặc điểm này của một hệ cơ sở dữ liệu không quan hệ là thích hợp hơn với các ứng dụng phân tán lớn như Digg hơn là các cơ sở dữ liệu quan hệ truyền thống.

Mặc dù hầu hết hạ tầng phần mềm của Digg sẽ dùng Cassandra, MySQL vẫn được dùng trong một số trường hợp đặc biệt như một vài ứng dụng cụ thể hay khi cần triển khai mô hình ứng dụng nhanh chóng. “MySQL có mức độ mềm dẻo mà Cassandra không thể có. Nó rất thích hợp với các dự án nhỏ”, Quinn cho biết.

Theo thongtincongnghe

BigTable Model with Cassandra and HBase

Recently in a number of “scalability discussion meeting”, I’ve seen the following pattern coming up repeatedly …

  1. To make your app scalable, you try to make your app layer “stateless”.
  2. OK, so you move the “state” out from your application layer out to a shared DB, or shared data layer.
  3. Now, how do we make the data tier scalable, by definition, we cannot make the data tier stateless.
  4. OK, now lets think about how to “partition” your data and spread them across multiple machines in such a way that workload is balanced.
  5. Now there are more boxes and what if some of them crashes.
  6. OK, we should replicate the data across machines.
  7. Now, how do we keep these data in sync …
  8. And then Cloud computing gets into the picture (as it always does). Now we are not just having a pool of machines but also the pool size can grow and shrink according to workload fluctuation (you don’t want to pay for something idle, right ?).
  9. Now we need to figure out as we add more machines into the pool or remove machine from the pool, how we should “redistribute” the data.

This is an area where NOSQL shines. In the last 18 months, NOSQL has become one of the hottest topic in the software industry. It has been introduced as a solution to large scale data storage problem at the range of Terabytes or Petabytes. Dozens of NOSQL products has come to the market, but two leaders HBase and Cassandra seems to stand out from the rest in terms of their adoption.

Given an increasing demand of explaining these 2 products recently, I decide to write a post on this.

Not to repeat the basic theory of NOSQL here, for a foundation ofdistributed system theory underlying the NOSQL design, please refer to my earlier blog

Both Hbase and Cassandra are based on Google BigTable model, here lets introduce some key characteristic underlying Bigtable first.

Fundamentally Distributed
BigTable is built from the ground up on a “highly distributed”, “share nothing” architecture. Data is supposed to store in large number of unreliable, commodity server boxes by “partitioning” and “replication”. Data partitioning means the data are partitioned by its key and stored in different servers. Replication means the same data element is replicated multiple times at different servers.


Column Oriented
Unlike traditional RDBMS implementation where each “row” is stored contiguous on disk, BigTable, on the other hand, store each column contiguously on disk. The underlying assumption is that in most cases not all columns are needed for data access, column oriented layout allows more records sitting in a disk block and hence can reduce the disk I/O.

Column oriented layout is also very effective to store very sparse data (many cells have NULL value) as well as multi-value cell. The following diagram illustrate the difference between a Row-oriented layout and a Column-oriented layout


Variable number of Columns
In RDBMS, each row must have a fixed set of columns defined by the table schema, and therefore it is not easy to support columns with multi-value attributes. The BigTable model introduces the “Column Family” concept such that a row has a fixed number of “column family” but within the “column family”, a row can have a variable number of columns that can be different in each row.


In the Bigtable model, the basic data storage unit is a cell, (addressed by a particular row and column). Bigtable allow multiple timestamp version of data within a cell. In other words, user can address a data element by the rowid, column name and the timestamp. At the configuration level, Bigtable allows the user to specify how many versions can be stored within each cell either by count (how many) or by freshness (how old).

At the physical level, BigTable store each column family contiguously on disk (imagine one file per column family), and physically sort the order of data by rowid, column name and timestamp. After that, the sorted data will be compressed so that a disk block size can store more data. On the other hand, since data within a column family usually has a similar pattern, data compression can be very effective.

Sequential write
BigTable model is highly optimized for write operation (insert/update/delete) with sequential write (no disk seek is needed). Basically, write happens by first appending a transaction entry to a log file (hence the disk write I/O is sequential with no disk seek), followed by writing the data into an in-memory Memtable . In case of the machine crashes and all in-memory state is lost, the recovery step will bring the Memtable up to date by replaying the updates in the log file.

All the latest update therefore will be stored at the Memtable, which will grow until reaching a size threshold, then it will flushed the Memtable to the disk as an SSTable (sorted by the String key). Over a period of time there will be multiple SSTables on the disk that store the data.

Merged read
Whenever a read request is received, the system will first lookup the Memtable by its row key to see if it contains the data. If not, it will look at the on-disk SSTable to see if the row-key is there. We call this the “merged read” as the system need to look at multiple places for the data. To speed up the detection, SSTable has a companion Bloom filter such that it can rapidly detect the absence of the row-key. In other words, only when the bloom filter returns positive will the system be doing a detail lookup within the SSTable.

Periodic Data Compaction
As you can imagine, it can be quite inefficient for the read operation when there are too many SSTables scattering around. Therefore, the system periodically merge the SSTable. Notice that since each of the SSTable is individually sorted by key, a simple “merge sort” is sufficient to merge multiple SSTable into one. The merge mechanism is based on a logarithm property where two SSTable of the same size will be merge into a single SSTable will doubling the size. Therefore the number of SSTable is proportion to O(logN) where N is the number of rows.

After looking at the common part, lets look at their difference of Hbase and Cassandra.

HBase
Based on the BigTable, HBase uses the Hadoop Filesystem (HDFS) as its data storage engine. The advantage of this approach is then HBase doesn’t need to worry about data replication, data consistency and resiliency because HDFS has handled it already. Of course, the downside is that it is also constrained by the characteristics of HDFS, which is not optimized for random read access. In addition, there will be an extra network latency between the DB server to the File server (which is the data node of Hadoop).


In the HBase architecture, data is stored in a farm of Region Servers. The “key-to-server” mapping is needed to locate the corresponding server and this mapping is stored as a “Table” similar to other user data table.

Before a client do any DB operation, it needs to first locate the corresponding region server.

  1. The client contacts a predefined Master server who replies the endpoint of a region server that holds a “Root Region” table.
  2. The client contacts the region server who replies the endpoint of a second region server who holds a “Meta Region” table, which contains a mapping from “user table” to “region server”.
  3. The client contacts this second region server, passing along the user table name. This second region server will lookup its meta region and reply an endpoint of a third region server who holds a “User Region”, which contains a mapping from “key range” to “region server”
  4. The client contacts this third region server, passing along the row key that it wants to lookup. This third region server will lookup its user region and reply the endpoint of a fourth region server who holds the data that the client is looking for.
  5. Client will cache the result along this process so subsequent request doesn’t need to go through this multi-step process again to resolve the corresponding endpoint.

In Hbase, the in-memory data storage (what we refer to as “Memtable” in above paragraph) is implemented in Memcache. The on-disk data storage (what we refer to as “SSTable” in above paragraph) is implemented as a HDFS file residing in Hadoop data node server. The Log file is also stored as an HDFS file. (I feel storing a transaction log file remotely will hurt performance)

Also in the HBase architecture, there is a special machine playing the “role of master” who monitors and coordinates the activities of all region servers (the heavy-duty worker node). To the best of my knowledge, the master node is the single point of failure at this moment.

For a more detail architecture description, Lars George has a very good explanation in the log file implementation as well as the data storage architecture of Hbase.

Cassandra
Also based on the BigTable model, Cassandra use the DHT (distributed hash table) model to partition its data, based on the paper described in the Amazon Dynamo model.

Consistent Hashing via O(1) DHT
Each machine (node) is associated with a particular id that is distributed in a keyspace (e.g. 128 bit). All the data element is also associated with a key (in the same key space). The server owns all the data whose key lies between its id and the preceding server’s id.

Data is also replicated across multiple servers. Cassandra offers multiple replication schema including storing the replicas in neighbor servers (whose id succeed the server owning the data), or a rack-aware strategy by storing the replicas in a physical location. The simple partition strategy is as follows …


Tunable Consistency Level
Unlike Hbase, Cassandra allows you to choose the consistency level that is suitable to your application, so you can gain more scalability if willing to tradeoff some data consistency.

For example, it allows you to choose how many ACK to receive from different replicas before considering a WRITE to be successful. Similarly, you can choose how many replica’s response to be received in the case of READ before return the result to the client.

By choosing the appropriate number for W and R response, you can choose the level of consistency you like. For example, to achieve Strict Consistency, we just need to pick W, R such that W + R > N. This including the possibility of (W = one and R = all), (R = one and W = all), (W = quorum and R = quorum). Of course, if you don’t need strict consistency, you can even choose a smaller value for W and R and gain a bigger availability. Regardless of what consistency level you choose, the data will be eventual consistent by the “hinted handoff”, “read repair” and “anti-entropy sync” mechanism described below.

Hinted Handoff
The client performs a write by send the request to any Cassandra node which will act as the proxy to the client. This proxy node will located N corresponding nodes that holds the data replicas and forward the write request to all of them. In case any node is failed, it will pick a random node as a handoff node and write the request with a hint telling it to forward the write request back to the failed node after it recovers. The handoff node will then periodically check for the recovery of the failed node and forward the write to it. Therefore, the original node will eventually receive all the write request.

Conflict Resolution
Since write can reach different replica, the corresponding timestamp of the data is used to resolve conflict, in other words, the latest timestamp wins and push the earlier timestamps into an earlier version (they are not lost)

Read Repair
When the client performs a “read”, the proxy node will issue N reads but only wait for R copies of responses and return the one with the latest version. In case some nodes respond with an older version, the proxy node will send the latest version to them asynchronously, hence these left-behind node will still eventually catch up with the latest version.

Anti-Entropy data sync
To ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.

Anti-entropy is the “catch-all” way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.

BigTable trade offs
To retain the scalability features of BigTable, some of the basic features of what RDBMS has provided is missing in the BigTable model. Here we highlight the rough edges of Bigtable.

1) Primitive transaction support
Transaction protection is only guaranteed within a single row. In other words, you cannot start a atomic transaction to modify multiple rows.

2) Primitive isolation support
While you are reading a row, other people may have modified the same row and update it before you. Your view is not current anymore but your later update can easily wipe off other people’s change.

There are many techniques how concurrent update can be isolated, including pessimistic approach like locking or optimistic approach by using vector clock to be the version stamp. But to the best of my understanding, there is no robust test-and-set operation in the BigTable model (this is some getLock mechanism in Hbase which I haven’t looked into), my impression is that there is no easy way to check there is no concurrent update happen in between.

Because of this limitation, I think BigTable model is more suitable for those applications where concurrent update to the same row is very rare, or some inconsistency is tolerable at the application level. Fortunately, there are still a lot of applications falling into this bucket.

3) No indexes
Notice that data within BigTable are all physically sorted; by rowid, column name and timestamp. There is no index from the column value to its containing rowid.

This model is quite different from RDBMS where you typically define a table and worry about defining the index later. There is no such “index” concept in BigTable and you need to carefully plan out the physical sorting order of your data layout.

Lacking index turns out to be quite inconvenient and many people using Bigtable ends up building their own index at the application level. This usually results in having a highly denormalized data model with lots of column family who store links to other tables. Any update to the base data need to carefully update these other column family as well. From a performance angle, this is actually better than maintaining index in RDBMS because Bigtable is optimized from writes. However, since it is now the application logic to maintain the index, this can be a source of application bugs.

4) No referential integrity enforcement
As mentioned above, since you are building artificial index at the application level, you need to maintain the integrity of your index as well. This includes update your index when the base data is inserted, modified or deleted. This kind of handling logic is traditionally residing at the RDBMS level, but since BigTable has no such referential integrity concept, this responsibility is now landed on your application logic.

5) Lack of surrounding tools
As NOSQL or BigTable is very new, the tools surrounding it is definitely not comparable to the RDBMS world at this moment, such tools includes report generation, BI, data warehouse … etc.

I observe the general trend that most NOSQL products are moving towards the direction to provide an ODBC / JDBC interface to integrate with existing tool markets easier. But at this moment, to the best of my understanding, such interface is not wide spread yet.

Design Patterns for Bigtable model
Due to the very different model of Bigtable, the data model design methodology is also quite different from traditional RDBMS schema design. Here is a sequence of steps that are pretty common …

1) Identify all your query scenarios
Since there is no index concept, you have to plan out carefully how your data is physically sorted. Therefore it is important to find all your query use cases first.

2) Define your “entity table” and its corresponding column families
For an entity table, it is pretty common to have one column family storing all the entity attributes, and multiple column families to store the links to other entities. (e.g. A “UserTable” may contain a column family “baseInfo” to store all attributes of the user, a column family “friend” to store the links to another user, a column family “company” to store links to another CompanyTable)

3) Define your “index table”
The “index table” is what your application build to support reverse lookup. The “key” is typically base on the search criteria you have identified in your query scenario. It is not uncommon that each query may have its own specific index table.

4) Make sure your application logic updates the index correctly
Since the index table has to be maintained by application logic, you need to check to make sure it is done correctly. In many cases, this can be quite a source of bugs.

It is important to realize that NOSQL is not advocating a replacement of RDBMS which has been proven in many lines of application. The NOSQL should be considered a complementary technologies for some niche area where RDBMS is not covering well.

Cassandra: Performance Tuning

Recently we’ve learned how to tune Cassandra garbage collection, but now we have a full presentation from Brandon Williams (@faltering) about tunning Cassandra for performance[1]:

Also make sure you check these articles about Cassandra write operation performance andCassandra reads performance.


  1. The video was recorded at Cassandra Summit NoSQL conference.  ()

References:

An ☞ interesting explanation of how Cassandra write ops are happening:

  • client submits its write request to a single, random Cassandra node
  • the node behavior is similar to a proxy writing data to the cluster
  • writes are replicated to N nodes according to the replication placement strategy (the details of RackAwareStrategy are quite interesting)
  • each of the N nodes performs 2 actions when receiving a write (in the form ofRowMutation):
    • append the mutation to the commit log for transactional purposes
    • update an in-memory Memtable structure with the change

There are also a couple of asynchronous operations:

  • Memtable is written to disk in a structure called SSTable
  • SSTables corresponding to a column family are merged into a raw ColumnFamily datafile.

You should definitely check the ☞ original post as there are more interesting details.

References:

Cassandra Reads Performance Explained

After explaining Cassandra writes performance, Mike Perham ☞ continues his series now explaining: “reads and […] why they are slow”.

So what happens with a Cassandra read?

  • a client makes a read request to a random node
  • the node acts as a proxy determining the nodes having copies of data
  • the node request the corresponding data from each node
  • the client can select the strength of the read consistency:
    • single read => the request returns once it gets the first response, but data can be stale
    • quorum read => the request returns only after the majority responded with the same value

      Mark mentions a couple of corner cases related to this behavior that is more complicated.

  • the node also performs read repair of any inconsistent response
  • each node reading data uses either Memtable (in-memory) or SSTables (disk)

    Mike and Jonathan provide a very detailed explanation of the read performance:

    Mike: To scan the SSTable, Cassandra uses a row-level column index and bloom filter to find the necessary blocks on disk, deserializes them and determines the actual data to return. There’s a lot of disk IO here which ultimately makes the read latency higher than a similar DBMS.

    Jonathan: The reason uncached reads are slower in Cassandra is not because theSSTable is inherently io-intensive (it’s actually better than b-tree based storage on a 1:1 basis) but because in the average case you’ll have to merge row fragments from 2-4 SSTables to complete the request, since SSTables are not update-in-place.

    It is also important to note that Cassandra employs row caching that addresses reads latency.

Mike’s post also covers Cassandra range scans and explains the role of Cassandra partitioning strategies☞ Great read!

References:

NoSQL Libraries

This page represents an ongoing effort to provide a quick reference to the various NoSQLlibraries.

Submit your library using this form!

Cassandra #

Python #

Cassandra Python Libraries

Ruby #

Cassandra Ruby Libraries

Other #

CouchDB #

Erlang #

CouchDB Erlang Libraries

Haskell #

CouchDB Haskell Libraries

Java #

CouchDB Java Libraries

Perl #

CouchDB Perl Libraries

PHP #

CouchDB PHP Libraries

Python #

CouchDB Python Libraries

Ruby #

CouchDB Ruby Libraries

Scala #

CouchDB Scala Libraries

.NET #

CouchDB .NET Libraries

Other #

Hadoop #

Ruby #

Hadoop Ruby Libraries

Other #

HBase #

Clojure #

HBase Clojure Libraries

Java #

HBase Java Libraries

Ruby #

HBase Ruby Libraries

MongoDB #

#

MongoDB C Libraries

Haskell #

MongoDB Haskell Libraries

Java #

MongoDB Java Libraries

Perl #

MongoDB Perl Libraries

PHP #

MongoDB PHP Libraries

Python #

MongoDB Python Libraries

Ruby #

MongoDB Ruby Libraries

Scala #

MongoDB Scala Libraries

Other #

GridFS #

Neo4j #

Java #

Neo4j Java Libraries

Ruby #

Neo4j Ruby Libraries

Other #

Redis #

Clojure #

Redis Clojure Libraries

C# #

Redis C# Libraries

Erlang #

Redis Erlang Libraries

Haskell #

Redis Haskell Libraries

Java #

Redis Java Libraries

Perl #

Redis Perl Libraries

PHP #

Redis PHP Libraries

Python #

Redis Python Libraries

Ruby #

Redis Ruby Libraries

Other #

Riak #

Riak Official Libraries #

Riak Official Libraries

Riak Official Libraries #

Riak Official Libraries

Ruby #

Riak Ruby Libraries

Scala #

Riak Scala Libraries

Other #

SimpleDB #

Python #

SimpleDB Python Libraries

Terrastore #

Scala #

Tokyo Cabinet #

Python #

Tokyo Cabinet Python Libraries

Ruby #

Tokyo Cabinet Ruby Libraries

Uncategorized #

Friendly ☞: Schema-less MySQL with Ruby

Gremlin ☞: Graph-based programming language

Fom ☞: Fluid Object Mapping (FluidDB)

Note: each section on this page is bookmarkable by using the # link.

You can submit your library using this form

Run Cassandra As A Windows Service

ne of the main issues that comes up over and over again for Cassandra is:

How do I run Cassandra as a Windows Service?

In this post I am going to answer that question and in the process demonstrate how to do it in less than 10 minutes.

Background

Cassandra is mainly developed by Linux developers so very little attention has been paid to the Windows developer or administrator as far as Cassandra goes.  So as Windows developers we have to hop through a couple more hoops than just clicking on an install.exe file and and letting it do all the work.  However lucky for us, those hoops are easy and quickly hopped through.

Step 1

If you haven’t done so already please read my jump start for Windows users on install Cassandra, this guide will get you ready for the next steps.

Step 2

The second step is also an easy one, you need to download a package called RunAsService, which provides the ability to run any program as a Windows Service.

After you have downloaded the file extract the contents to a directory of your choosing.  (I extracted it to c:\RunAsService)

Note: RunAsService was originally developed here, however I recompiled it to run on .NET 2.0.

Step 3

To install RunAsService open up a command prompt with Administrative privileges and run this command.

cd c:\RunAsService
install networkservice

This registers RunAsService with your Windows Service.  Make sure to keep your command prompt open because you will need it for the 5th step.

Step 4

To configure RunAsService for Cassandra open up the RunAsService.exe.config file in your favorite text editor and replace <service.settings> section with the following so that it looks like this:

<!– Services configuration –>
<service.settings>
<!– Run Cassandra as a service –>
<!– My Cassandra install path is C:\apache-cassandra\ –>
<service>
<name>Cassandra Database</name>
<executable>C:\apache-cassandra\bin\cassandra.bat</executable>
<parameters></parameters>
</service>
</service.settings>

<!– Services configuration –><service.settings>    <!– Run Cassandra as a service –>    <!– My Cassandra install path is C:\apache-cassandra\ –>    <service>        <name>Cassandra Database</name>        <executable>C:\apache-cassandra\bin\cassandra.bat</executable>        <parameters></parameters>    </service></service.settings>

After you have finished, save the config file and exit your text editor.

Note: My Cassandra install is in c:\apache-cassandra\ you will have to correct the config above for where you installed it if different than mine.

Step 5

The last and final step of this process is to start the RunAsService service.  You can either do it through the Services control panel or just type the following in to your command prompt.

net start runasservice

You should see a response in the command line saying that the service has been successfully started.  To verify that Cassandra has been started you can use the cassandra-cli.bat file:

cd c:\apache-cassandra\bin\ cassandra-cli.bat connect localhost/9160

It should report that it is connected to the server if the service is running.  And with that we are done, and I told you it would only take about 10 minutes.

Powered by WordPress | Theme: by 85ideas. Editor by Khoanguyen