Category: benchmark


Benchmarks for DATE operations in MySQL

This article compares the relative speed of extracting the date part of a value in MySQL with LEFT() and with the DATE() function.

LEFT() is faster than DATE(). To prove this, I inserted two million un-indexed sequential values into a table and selected the minimum and maximum values. Both queries are table scans, so it does read through all the records. The table below lists the time in seconds for MAX() on my computer. I tested with three data types: DATE, TIMESTAMP and DATETIME.

I don’t know why it’s faster to use LEFT() than DATE(). I would assume the reverse to be true, but clearly it’s not, at least on the systems I’ve tested.

The time for MIN() is one or two milliseconds faster than MAX(), probably because the values are sequential and only one assignment is performed, whereas the MAX() query must perform two million assignments.

Most of high scale web applications use MySQL + memcached. Many of them use also NoSQL like TokyoCabinet/Tyrant. In some cases people have dropped MySQL and have shifted to NoSQL. One of the biggest reasons for such a movement is that it is said that NoSQL performs better than MySQL for simple access patterns such as primary key lookups. Most of queries from web applications are simple so this seems like a reasonable decision.
Like many other high scale web sites, we at DeNA(*) had similar issues for years. But we reached a different conclusion. We are using “only MySQL”. We still use memcached for front-end caching (i.e. preprocessed HTML, count/summary info), but we do not use memcached for caching rows. We do not use NoSQL, either. Why? Because we could get much better performance from MySQL than from other NoSQL products. In our benchmarks, we could get 750,000+ qps on a commodity MySQL/InnoDB 5.1 server from remote web clients. We also have got excellent performance on production environments.
Maybe you can’t believe the numbers, but this is a real story. In this long blog post, I’d like to share our experiences.
(*) For those who do not know.. I left Oracle in August 2010. Now I work at DeNA, one of the largest social game platform providers in Japan. Continue reading “Using MySQL as a NoSQL – A story for exceeding 750,000 qps on a commodity server” »

C# List Insert Mistake

You want to see how List.Insert performs. The Insert method on List has serious problems that may significantly degrade performance. Here we benchmark the impact of Insert and look at an example, using the C# programming language.

Inserting into List

Here I found that the Insert instance method on the List type does not have good performance in many cases. The List type is able to allocate new memory at the end quickly in Add, but for Insert, it has to adjust the following elements. Continue reading “C# List Insert Mistake” »

PHP vs .NET performance #1

Background
I’ve spent a great many years developing quite large scale websites using Microsoft ASP and ASP.NET based content management systems. In the last couple of years I’ve also built some sites with PHP based CMSs.

Like many people, I’ve been curious about the relative performance of PHP vs .NET but never seemed able to find any conclusive benchmarks, since much of the information is distorted by the religious fervour of those programmers strongly in the PHP or .NET camps.

For what it’s worth, I very much like the idea of open source software but I would much rather know (and speak) the truth regardless if that conflicts with my personal interests.

The benchmark
My benchmark is very simple, measure how long it takes a page running in both C#.NET and PHP to count up to a reasonably large number (in this case 10,000,000) and output the time taken.

Both tests were run on the same machine, an Ubuntu / Vista dual-booted Intel quad core with 8GB RAM, though the hardware spec is irrelevant since it’s identical for both tests.

The code

<h1>C# Performance Test</h1>
<%
int x = 0;
DateTime start = DateTime.Now;
for (int i=0; i < 10000000; i++)
{
x = x + 1;
}
DateTime end  = DateTime.Now;
TimeSpan total_time = end – start;
Context.Response.Write(“Total Time=”+total_time.Milliseconds);
%>

<h1>PHP Performance Test</h1>
<?php
$x=0;
$start = microtime(true);
for ($i = 0; $i < 10000000; $i++) {
$x = $x + 1;
}
$end = microtime(true);
$total_time = $end-$start;
print “Total time=”.$total_time.”;
?>

Results

I must say, I’m shocked and saddened by the result. Yes, real world server performance would be very different, but nonetheless…

PHP: 2450 milliseonds

.NET: 45 milliseconds

Yes, you read that correctly, .NET is 55 times faster than PHP in my simple benchmark !

(Please read the important update below)

Conclusions

.NET is compiled and PHP is interpreted so it was “obvious” that .NET would be faster, right? The reality is not quite so simple (or so I thought). .NET is actually compiled into “byte-code” to be executed by a run-time engine and PHP is parsed into “op-code” to also be executed by a run-time engine. There are differences in these two approaches, but they aren’t so very different. .NET is *not* executed as native object code. Also in PHP much of the actual work will be done inside functions written in compiled C code called by the run-time engine.

I was still expecting .NET to be faster but not to such a large extent.

My benchmark is extremely simplistic and doesn’t much correspond to a real world website where many other factors such as database speed, memory usage, caching, pipe-lining efficiency etc will all come into play. However, as an approximate measure of raw computing speed I think it’s a pretty reasonable test. As a measure of overall server performance, it really won’t be that simple.

The vast majority of even quite high profile websites receive relatively modest traffic which can be handled comfortably by a comparatively modest server. For most websites, development time & costs are more relevant than raw speed.

Scaling-out to handle high traffic loads is generally cheaper with PHP (ie open source)  server stacks due to the lack of licence fees. Getting to the point where you need to scale without having had to pay huge licence fees is big plus too.

Nonetheless, whichever way I slice it, .NET is 55 times faster than PHP. That is something worth thinking about !

Faster Still

Worth noting that similar tests that I’ve run indicate that C++ is 6 times faster again than .NET

Not very many people advocate developing websites in C++. Speed really isn’t everything !

I’m going to run some more benchmarks and I’ll post the results.

Important update

I’ve done a bit more testing which has turned up some interesting (and quite heart warming) results.

I’ve tweaked the benchmark program so that rather than simply incrementing an integer, it performs a string concatenation.

eg changed x = x + 1 to x = x + “x” (and $x = $x.”x”)

To non-programmers the above statements might look identical but they are very, very different in terms of the amount of work the cpu has to do. The first statement uses the same few bytes of memory over and over again, the second rapidly consumes megabytes of memory. I also had to reduce the number of iterations to 100,000 to prevent script time outs in both languages.

PHP handled the new benchmark in only 648 milliseconds.
.NET claimed to be taking anywhere between 184 milliseconds and 999 milliseconds, yet by counting slowly in my head, I could see that it was consistently taking longer than 5000 milliseconds. Turns out that .NET has known issues with DateTime.Now accurracy when it’s under load !!!!

In both cases as the number of interations increased, the benchmark time increased exponentially which is what you would expect given how most languages handle string concatenation. (Some people might argue I should be using a .NET stringbuilder, but that would be missing the point of this benchmark)

The new benchmark still falls short of representing a real world web application but this second example is closer as web apps are all about concatenating many strings of html together.

There is nothing else for it, I’m going to have to set up a proper benchmark and test a few real world CMS in a realistic scenario. (I’ve been meaning to do this for ages). But for now, I have to sleep!

I wanted to compare the following DBs, NoSQLs and caching solutions for speed and connections. Tested the following

My test had the following criteria

  • 2 client boxes
  • All clients connecting to the server using Python
  • Used Python’s threads to create concurrency
  • Each thread made 10,000 open-close connections to the server
  • The server was
    • Intel(R) Pentium(R) D CPU 3.00GHz
    • Fedora 10 32bit
    • Intel(R) Pentium(R) D CPU 3.00GHz
    • 2.6.27.38-170.2.113.fc10.i686 #1 SMP
    • 1GB RAM
  • Used a md5 as key and a value that was saved
  • Created an index on the key column of the table
  • Each server had SET and GET requests as a different test at same concurrency

Results please !

Work sheet

throughput set

throughput get

I wanted to simulate a situation where I had 2 servers (clients) serving my code, which connected to the 1 server (memcached, redis, or whatever). Another thing to note was that I used Python as the client in all the tests, definately the tests would give a different output had I used PHP. Again the test was done to check how well the clients could make and break the connections to the server, and I wanted the overall throughput after making and breaking the connections. I did not monitor the response times. I didnt change absolutely any parameters for the servers, eg didn’t change the innodb_buffer_pool_size or key_buffer_size.

MySQL

MySQL lacked the whole scene terribly, I monitored the MySQL server via the MySQL Administrator and found that hardly there were any conncurrent inserts or selects, I could see the unauthenticated users, which meant that the client had connected to MySQL and was doing a handshake using MySQL authentication (using username and password). As you could see I didn’t even perform the 40 and 60 thread tests.

I truncated the table before I swtiched my tests from MyISAM to InnoDB. And always started the tests from lesser threads. My table was as follows

CREATE TABLE `comp_dump` (
  `k` char(32) DEFAULT NULL,
  `v` char(32) DEFAULT NULL,
  KEY `ix_k` (`k`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

NoSQL

For Tokyo Tyrant I used a file.tch as the DB, which is a hash database. I also tried MongoDB as u may find if u have opened the worksheet, But the server kept failing or actually the mongod failed after coming at an unhandled Exception. I found something similar over here. I tried 1.0.1, 1.1.3 and the available Nightly build, but all failed and I lost my patience.

Now what

If you need speed just to fetch a data for a given combination or key, Redis is a solution that you need to look at. MySQL can no way compare to Redis and Memcache. If you find Memcache good enough, you may want to look at Tokyo Tyrant as it does a synchronous writes. But you need to check for your application which server/combination suits you the best. In Marathi there is a saying “मेल्या शिवाय स्वर्ग दिसत नाही”, which means “You can’t see heaven without dieing” or need to do your hard work, can’t escape that ;)

I’ve attached the source code used to test, if anybody has any doubts, questions feel free to ask

Attachment Size
throughput-get.png 8.57 KB
throughput-set.png 8.65 KB
worksheet.png 42.36 KB
comparision.tar.gz 7.46 KB

Cassandra vs HBase

We are an ad network. We need to store impressions and clicks. We were evaluating various big data (or nosql -what ever you may choose to call it) systems for our new project. We had used HBase for past 8 months in an experimental product and are satisfied with it but the hype about Cassandra was so much that we decided to give it a shot. I think for some reason Cassandra team has succeeded in marketing itself very well. You will find even non technical people (such as VCs, CEOs, Product Managers) in Santa Monica recommending Cassandra to each other.

First impression of Cassandra was a good one. Their web page looks much more professional and nicer than HBase. It’s very easy to get it up and running. The website is well documented. Literally, it took me 5 minutes to set it up and get it running.

The real challenge was to understand Cassandra data model and try to implement it for our use cases. It was very clear to us how we will do it with HBase as we have a pretty good experience with it. Even though Cassandra inherits the same data model from BigTable, there are some fundamental differences between Cassandra and HBase. I have tried to tabulate the differences between the two systems below:

Cassandra HBase
Lacks concept of a Table. All the documentation will tell you that it’s not common to have multiple keyspaces. That means you have to share a key space in a cluster. Furthermore adding a keyspace requires a cluster restart! Concept of Table exists. Each table has it’s own key space. This was a big win for us. You can add and remove table as as easily as a RDBMS.
Uses string keys. Very common to use uuids as the keys. You can use TimeUUID if you want your data to be sorted by time. Uses binary keys. It’s common to combine three different items together to form a key. This means you can search by more than one key in a give table.
Even if you use TimeUUID, as Cassandra load balances client requests, hot spotting problem won’t occur. (All the client requests going to one server in a cluster is known as a hot spot problem). If your key’s first component is time or a sequential number, then hotspotting occurs. All of the new keys will be inserted to one region until it fills up (hence by causing a hotspotting problem).
Offers sorting of columns. Does not have sorting of columns.
Concept of Supercolumn allows you to design very flexible, very complex schemas. Does not have supercolumns. But you can design a super column like structure as column names and values are binary.
Does not have any convinience method to to increment a column value. In fact the vary nature of eventual consistency makes it difficult to update/write a record and read it instantly after the update. You have to make sure that R + W > N to achive strong consitency. By design consitent. Offers a nice convinience method to increment counters. Very much suitable for data aggregation.
Map Reduce support is new. You will need a hadoop cluster to run it. Data will be tranferred from Cassandra cluster to the hadoop cluster. No suitable for running large data map reduce jobs. Map Reduce support is native. HBase is built on Hadoop. Data does not get transferred.
Comparatively simpler to maintain if you don’t have to have hadoop. Comparatively complicated as you have it has many moving pieces such as Zookeeper, Hadoop and HBase itself.
Does not have a native java api as of now. No java doc. Even though written in Java, you have to use Thrift to communicate with the cluster. Has a nice native java api. Feels much more java system than Cassandra. Being a java shop, it was important for us. HBase has a thrift interface for other laguages too.
No master server, hence no single point of failure. Although there exists a concept of a master server, HBase itself does not depend on it heavily. HBase cluster can keep serving data even if the master goes down. Hadoop namenode is a single point of failure.

After comparing the data model and features in this way, HBase was the clear winner for us. In my opinion, if consistency is what you need, HBase is a clear choice. Furthermore native map reduce, concept of a table and simpler schema that can be modified without a cluster restart are a big plus you cannot ignore. HBase is a much more mature platform. When people site Twitter, Facebook using Cassandra, they forget that the same organizations are using HBase too. In fact one of the committers of HBase was recenlty hired by Facebook which clearly shows their interest in HBase.

So in short, we will root for team HBase!

Motivation

At Medallia, a key component of our system currently works with an open source relational db. Since this component mainly queries the db entries by key, we want to try to switch to a key-value storage system and take advantage of several benefits provided by such a system, including distributed replication, load balancing, and failover. One of our objectives is to re-architect this component in a way that will allow us to achieve horizontal scalability, that among other things will help us alleviate the high disk storage requirements we currently have.

Recently we took the time to look into this (and other technological improvements too, exciting times at Medallia right now!), and we reviewed several options. To make a long story short, we ended up with two finalists, Apache Cassandra and Project Voldemort.

These two projects seem to be the most mature open source options in their class, and both provide a native decentralized clustering support including partitioning, fault tolerance, and high availability. Both are based on Amazon’s Dynamo paper, but the main difference is that Voldemort follows a simple key/value model, while Cassandra uses a persistency model based onBigTable‘s column oriented model. Both provide support for read-consistency where read operations always return latest data, which was a requirement for us.

High level comparison

Project Voldemort

While not an exhaustive list, these are the most relevant pros and cons we identified when reviewing both stores:

  • Pros
    • Simpler API
    • Persistency based on Berkley DB, a mature and well-known key/value db
    • Uses Vector Clocks instead of simple timestamps. It doesn’t need the nodes (or clients) clocks to be synchronized
  • Cons
    • No built-in support for “multiple data center”-aware routing (meaning there must be 1 copy of each key in at least one data center)

Apache Cassandra

  • Pros
    • Broader range of systems in production (Facebook, Twitter, Digg, Rackspace)
    • Richer API which supports values with a dynamic column structure. The columns can evolve independently, meaning that you can update one column without reading the whole structure.
    • Optimized for writes (by design)
    • Configurable consistency level (specified on each request)
  • Cons
    • File format is still in development, changes to the internal structure are likely to happen. Due to the flexibility it provides, the file format is more complex and harder to reason with, especially in terms of performance
    • Requires Clock Synch (NTP) (for nodes and clients)
    • Reads are more disk-intensive than competitors
    • Doesn’t support client conflict resolution, so the latest update always wins

Performance Tests

To our surprise this was the only link we’ve found that compares the performance for both projects – thus we decided to write this post to share our research. We used the vpork test framework, which we modified to suit our needs by upgrading the client code to the latest versions, adding a warm-up phase, and adding rewrite capabilities. These are the results of our tests:

Setup:

  • Versions
    • Voldemort v0.80.1
    • Cassandra 0.6.0-beta3
  • Boxes: 3 similar nodes with the following spec:
    • 4GB maximum heap size
    • Replication parameters: N=3 (replicas for each entry), R=2 (nodes to wait for on each read), W=2 (nodes to block for on each write)
    • 8 processors on each server (Intel(R) Xeon(R) CPU E5504 @ 2.00GHz)
    • 1TB disk space (Seagate ST31000340NS, 7200 RPM, 32MB cache)
  • Persistence parameters
    • Voldemort (default values)
      • key-serializer: string
      • value-serializer: identity (byte array)
      • persistence engine=bdb (Berkley DB)
      • bdb.cache.size=1536MB
    • Cassandra
      • ColumnFamily definition: CompareWith=”BytesType” RowsCached=”10000″
      • ReplicationFactor=3
      • Partitioner=org.apache.cassandra.dht.RandomPartitioner
      • ConcurrentReads=16
      • ConcurrenWrites=32
  • Tests
    • Client Threads: 40
    • Initial load: 5 million records – records present before starting each test
    • WarmUp: 20K records – initial writes before measuring time for each test
    • Number of operations per test: 500K

We ran tests for 4 different write-rewrite-read configurations. A write is equivalent to a put operation with a new record (non-existing key). A rewrite is a put operation with an existing key. A read is a get operation on an existing key. These are the configurations we tested:

  • 50% Write 50% Read
  • 10% Write 40% Rewrite 50% Read
  • 50% Rewrite 50% Read
  • 90% Rewrite 10% Read

We ran all the tests for two different value sizes, 15 and 1.5 KB. Even though we evaluated different options, for our current needs, the last one with a 15 KB data entry was the most representative scenario.

The first pair of charts shows the latency, or average time it takes a read or write operation to complete in each case. Lower values are better. As expected, Cassandra write (and rewrite) times were consistently faster than Voldemort, while read times varied a bit depending on the scenario but were more or less the same in general.

test-avg-lat-1.5.png
test-avg-lat-15.png
The second pair of charts shows the maximum time in the best 99% of cases; again lower values are better:

test-99-lat-1.5.png

test-99-lat-15.png

On the front-end, we have a write-back cache which means that write operations don’t affect the user experience. On the other hand, read operations are directly related to page loads. That’s why we were concerned about the peak for Cassandra read in the last scenario for 15KB. We ran some further tests to measure the 99.9% and 99.99% percentiles and the difference was even greater: 5050 ms for Cassandra and 748 ms for Voldemort in the first case, and 9176 ms against 1129 ms in the second case. This huge difference was a key decision factor for us.

Finally, these two charts show the general throughput in terms of operations (read or write) per second. In this case higher values are better:

test-ops-1.5.png
test-ops-15.png

Notes:

  • Cassandra commit log and data folder are supposed to be placed at different disks to improve performance, we tested with both on the same disk.

Issues found while testing:

  • Voldemort client.put(K key, V value) (not the one that takes a Version object) throws ObsoleteVersionException if called with the same key from different threads. The javadoc states “Associated (sic) the given value to the key, clobbering any existing values stored for the key. “, so this was not expected.

And the winner is …

I think there is no clear winner, in general terms. The best option depends on many factors that each company has to evaluate. My preference changed a few times during the review and tests.

Having said that, we had to choose one, and we decided to go with Project Voldemort. The main reasons were the simplicity, better versioning control, persistency layer maturity, and latency predictability.

We are currently developing the new solution, and it will take some time before we can put it in production, but we wanted to share our preliminary results with everyone who is considering one of these two options, so they’ll have one more tool at the time of the decision.

We’ll keep you posted on how it goes.

Diego Erdody
Lead Software Engineer
Other useful articles comparing different key-value stores:

My previous post Redis, Memcache, Tokyp Tyrant, MySQL comparison. The MySQL was taking a huge time for doing a reverse DNS lookup.

I turned on the skip-name-resolve parameter in the my.cnf and the Throughput of MySQL grew considerably, almost more than double.

Here are the new results.

GET

SET

worksheet

MyISAM vs InnoDB

Nothing much has changed in the above test. Except for the fact InnoDB starts leading the way when there are high number of concurrent Inserts/Updates or Writes on the table. As seen from the “Set” graph InnoDB starts closing for MyISAM’s write efficiency around 30 concurrent requests and then by 60 concurrent requests its already ahead in throughput of writes – 1284/s against 825/s. Further I had put a watch on processlist and was watching the processess, there were times during MyISAM when the inserts took over 6seconds to finish, which also means that if you are in a need of an application which requires quicker response during heavy loads / heavy concurrency… You need to check the MyISAM vs. InnoDB scenario really closely. At low concurrency MyISAM is well ahead in writes, and in Reads, both MyISAM and InnoDB perform equally well.

Again you need to make sure that you check ur test conditions really well before just taking InnoDB for granted.

Attachment Size
throughput-get2.png 7.96 KB
throughput-set2.png 8.71 KB
worksheet-2.png 23.31 KB
comparision.ods 29.02 KB

BigTable Model with Cassandra and HBase

Recently in a number of “scalability discussion meeting”, I’ve seen the following pattern coming up repeatedly …

  1. To make your app scalable, you try to make your app layer “stateless”.
  2. OK, so you move the “state” out from your application layer out to a shared DB, or shared data layer.
  3. Now, how do we make the data tier scalable, by definition, we cannot make the data tier stateless.
  4. OK, now lets think about how to “partition” your data and spread them across multiple machines in such a way that workload is balanced.
  5. Now there are more boxes and what if some of them crashes.
  6. OK, we should replicate the data across machines.
  7. Now, how do we keep these data in sync …
  8. And then Cloud computing gets into the picture (as it always does). Now we are not just having a pool of machines but also the pool size can grow and shrink according to workload fluctuation (you don’t want to pay for something idle, right ?).
  9. Now we need to figure out as we add more machines into the pool or remove machine from the pool, how we should “redistribute” the data.

This is an area where NOSQL shines. In the last 18 months, NOSQL has become one of the hottest topic in the software industry. It has been introduced as a solution to large scale data storage problem at the range of Terabytes or Petabytes. Dozens of NOSQL products has come to the market, but two leaders HBase and Cassandra seems to stand out from the rest in terms of their adoption.

Given an increasing demand of explaining these 2 products recently, I decide to write a post on this.

Not to repeat the basic theory of NOSQL here, for a foundation ofdistributed system theory underlying the NOSQL design, please refer to my earlier blog

Both Hbase and Cassandra are based on Google BigTable model, here lets introduce some key characteristic underlying Bigtable first.

Fundamentally Distributed
BigTable is built from the ground up on a “highly distributed”, “share nothing” architecture. Data is supposed to store in large number of unreliable, commodity server boxes by “partitioning” and “replication”. Data partitioning means the data are partitioned by its key and stored in different servers. Replication means the same data element is replicated multiple times at different servers.


Column Oriented
Unlike traditional RDBMS implementation where each “row” is stored contiguous on disk, BigTable, on the other hand, store each column contiguously on disk. The underlying assumption is that in most cases not all columns are needed for data access, column oriented layout allows more records sitting in a disk block and hence can reduce the disk I/O.

Column oriented layout is also very effective to store very sparse data (many cells have NULL value) as well as multi-value cell. The following diagram illustrate the difference between a Row-oriented layout and a Column-oriented layout


Variable number of Columns
In RDBMS, each row must have a fixed set of columns defined by the table schema, and therefore it is not easy to support columns with multi-value attributes. The BigTable model introduces the “Column Family” concept such that a row has a fixed number of “column family” but within the “column family”, a row can have a variable number of columns that can be different in each row.


In the Bigtable model, the basic data storage unit is a cell, (addressed by a particular row and column). Bigtable allow multiple timestamp version of data within a cell. In other words, user can address a data element by the rowid, column name and the timestamp. At the configuration level, Bigtable allows the user to specify how many versions can be stored within each cell either by count (how many) or by freshness (how old).

At the physical level, BigTable store each column family contiguously on disk (imagine one file per column family), and physically sort the order of data by rowid, column name and timestamp. After that, the sorted data will be compressed so that a disk block size can store more data. On the other hand, since data within a column family usually has a similar pattern, data compression can be very effective.

Sequential write
BigTable model is highly optimized for write operation (insert/update/delete) with sequential write (no disk seek is needed). Basically, write happens by first appending a transaction entry to a log file (hence the disk write I/O is sequential with no disk seek), followed by writing the data into an in-memory Memtable . In case of the machine crashes and all in-memory state is lost, the recovery step will bring the Memtable up to date by replaying the updates in the log file.

All the latest update therefore will be stored at the Memtable, which will grow until reaching a size threshold, then it will flushed the Memtable to the disk as an SSTable (sorted by the String key). Over a period of time there will be multiple SSTables on the disk that store the data.

Merged read
Whenever a read request is received, the system will first lookup the Memtable by its row key to see if it contains the data. If not, it will look at the on-disk SSTable to see if the row-key is there. We call this the “merged read” as the system need to look at multiple places for the data. To speed up the detection, SSTable has a companion Bloom filter such that it can rapidly detect the absence of the row-key. In other words, only when the bloom filter returns positive will the system be doing a detail lookup within the SSTable.

Periodic Data Compaction
As you can imagine, it can be quite inefficient for the read operation when there are too many SSTables scattering around. Therefore, the system periodically merge the SSTable. Notice that since each of the SSTable is individually sorted by key, a simple “merge sort” is sufficient to merge multiple SSTable into one. The merge mechanism is based on a logarithm property where two SSTable of the same size will be merge into a single SSTable will doubling the size. Therefore the number of SSTable is proportion to O(logN) where N is the number of rows.

After looking at the common part, lets look at their difference of Hbase and Cassandra.

HBase
Based on the BigTable, HBase uses the Hadoop Filesystem (HDFS) as its data storage engine. The advantage of this approach is then HBase doesn’t need to worry about data replication, data consistency and resiliency because HDFS has handled it already. Of course, the downside is that it is also constrained by the characteristics of HDFS, which is not optimized for random read access. In addition, there will be an extra network latency between the DB server to the File server (which is the data node of Hadoop).


In the HBase architecture, data is stored in a farm of Region Servers. The “key-to-server” mapping is needed to locate the corresponding server and this mapping is stored as a “Table” similar to other user data table.

Before a client do any DB operation, it needs to first locate the corresponding region server.

  1. The client contacts a predefined Master server who replies the endpoint of a region server that holds a “Root Region” table.
  2. The client contacts the region server who replies the endpoint of a second region server who holds a “Meta Region” table, which contains a mapping from “user table” to “region server”.
  3. The client contacts this second region server, passing along the user table name. This second region server will lookup its meta region and reply an endpoint of a third region server who holds a “User Region”, which contains a mapping from “key range” to “region server”
  4. The client contacts this third region server, passing along the row key that it wants to lookup. This third region server will lookup its user region and reply the endpoint of a fourth region server who holds the data that the client is looking for.
  5. Client will cache the result along this process so subsequent request doesn’t need to go through this multi-step process again to resolve the corresponding endpoint.

In Hbase, the in-memory data storage (what we refer to as “Memtable” in above paragraph) is implemented in Memcache. The on-disk data storage (what we refer to as “SSTable” in above paragraph) is implemented as a HDFS file residing in Hadoop data node server. The Log file is also stored as an HDFS file. (I feel storing a transaction log file remotely will hurt performance)

Also in the HBase architecture, there is a special machine playing the “role of master” who monitors and coordinates the activities of all region servers (the heavy-duty worker node). To the best of my knowledge, the master node is the single point of failure at this moment.

For a more detail architecture description, Lars George has a very good explanation in the log file implementation as well as the data storage architecture of Hbase.

Cassandra
Also based on the BigTable model, Cassandra use the DHT (distributed hash table) model to partition its data, based on the paper described in the Amazon Dynamo model.

Consistent Hashing via O(1) DHT
Each machine (node) is associated with a particular id that is distributed in a keyspace (e.g. 128 bit). All the data element is also associated with a key (in the same key space). The server owns all the data whose key lies between its id and the preceding server’s id.

Data is also replicated across multiple servers. Cassandra offers multiple replication schema including storing the replicas in neighbor servers (whose id succeed the server owning the data), or a rack-aware strategy by storing the replicas in a physical location. The simple partition strategy is as follows …


Tunable Consistency Level
Unlike Hbase, Cassandra allows you to choose the consistency level that is suitable to your application, so you can gain more scalability if willing to tradeoff some data consistency.

For example, it allows you to choose how many ACK to receive from different replicas before considering a WRITE to be successful. Similarly, you can choose how many replica’s response to be received in the case of READ before return the result to the client.

By choosing the appropriate number for W and R response, you can choose the level of consistency you like. For example, to achieve Strict Consistency, we just need to pick W, R such that W + R > N. This including the possibility of (W = one and R = all), (R = one and W = all), (W = quorum and R = quorum). Of course, if you don’t need strict consistency, you can even choose a smaller value for W and R and gain a bigger availability. Regardless of what consistency level you choose, the data will be eventual consistent by the “hinted handoff”, “read repair” and “anti-entropy sync” mechanism described below.

Hinted Handoff
The client performs a write by send the request to any Cassandra node which will act as the proxy to the client. This proxy node will located N corresponding nodes that holds the data replicas and forward the write request to all of them. In case any node is failed, it will pick a random node as a handoff node and write the request with a hint telling it to forward the write request back to the failed node after it recovers. The handoff node will then periodically check for the recovery of the failed node and forward the write to it. Therefore, the original node will eventually receive all the write request.

Conflict Resolution
Since write can reach different replica, the corresponding timestamp of the data is used to resolve conflict, in other words, the latest timestamp wins and push the earlier timestamps into an earlier version (they are not lost)

Read Repair
When the client performs a “read”, the proxy node will issue N reads but only wait for R copies of responses and return the one with the latest version. In case some nodes respond with an older version, the proxy node will send the latest version to them asynchronously, hence these left-behind node will still eventually catch up with the latest version.

Anti-Entropy data sync
To ensure the data is still in sync even there is no READ and WRITE occurs to the data, replica nodes periodically gossip with each other to figure out if anyone out of sync. For each key range of data, each member in the replica group compute a Merkel tree (a hash encoding tree where the difference can be located quickly) and send it to other neighbors. By comparing the received Merkel tree with its own tree, each member can quickly determine which data portion is out of sync. If so, it will send the diff to the left-behind members.

Anti-entropy is the “catch-all” way to guarantee eventual consistency, but is also pretty expensive and therefore is not done frequently. By combining the data sync with read repair and hinted handoff, we can keep the replicas pretty up-to-date.

BigTable trade offs
To retain the scalability features of BigTable, some of the basic features of what RDBMS has provided is missing in the BigTable model. Here we highlight the rough edges of Bigtable.

1) Primitive transaction support
Transaction protection is only guaranteed within a single row. In other words, you cannot start a atomic transaction to modify multiple rows.

2) Primitive isolation support
While you are reading a row, other people may have modified the same row and update it before you. Your view is not current anymore but your later update can easily wipe off other people’s change.

There are many techniques how concurrent update can be isolated, including pessimistic approach like locking or optimistic approach by using vector clock to be the version stamp. But to the best of my understanding, there is no robust test-and-set operation in the BigTable model (this is some getLock mechanism in Hbase which I haven’t looked into), my impression is that there is no easy way to check there is no concurrent update happen in between.

Because of this limitation, I think BigTable model is more suitable for those applications where concurrent update to the same row is very rare, or some inconsistency is tolerable at the application level. Fortunately, there are still a lot of applications falling into this bucket.

3) No indexes
Notice that data within BigTable are all physically sorted; by rowid, column name and timestamp. There is no index from the column value to its containing rowid.

This model is quite different from RDBMS where you typically define a table and worry about defining the index later. There is no such “index” concept in BigTable and you need to carefully plan out the physical sorting order of your data layout.

Lacking index turns out to be quite inconvenient and many people using Bigtable ends up building their own index at the application level. This usually results in having a highly denormalized data model with lots of column family who store links to other tables. Any update to the base data need to carefully update these other column family as well. From a performance angle, this is actually better than maintaining index in RDBMS because Bigtable is optimized from writes. However, since it is now the application logic to maintain the index, this can be a source of application bugs.

4) No referential integrity enforcement
As mentioned above, since you are building artificial index at the application level, you need to maintain the integrity of your index as well. This includes update your index when the base data is inserted, modified or deleted. This kind of handling logic is traditionally residing at the RDBMS level, but since BigTable has no such referential integrity concept, this responsibility is now landed on your application logic.

5) Lack of surrounding tools
As NOSQL or BigTable is very new, the tools surrounding it is definitely not comparable to the RDBMS world at this moment, such tools includes report generation, BI, data warehouse … etc.

I observe the general trend that most NOSQL products are moving towards the direction to provide an ODBC / JDBC interface to integrate with existing tool markets easier. But at this moment, to the best of my understanding, such interface is not wide spread yet.

Design Patterns for Bigtable model
Due to the very different model of Bigtable, the data model design methodology is also quite different from traditional RDBMS schema design. Here is a sequence of steps that are pretty common …

1) Identify all your query scenarios
Since there is no index concept, you have to plan out carefully how your data is physically sorted. Therefore it is important to find all your query use cases first.

2) Define your “entity table” and its corresponding column families
For an entity table, it is pretty common to have one column family storing all the entity attributes, and multiple column families to store the links to other entities. (e.g. A “UserTable” may contain a column family “baseInfo” to store all attributes of the user, a column family “friend” to store the links to another user, a column family “company” to store links to another CompanyTable)

3) Define your “index table”
The “index table” is what your application build to support reverse lookup. The “key” is typically base on the search criteria you have identified in your query scenario. It is not uncommon that each query may have its own specific index table.

4) Make sure your application logic updates the index correctly
Since the index table has to be maintained by application logic, you need to check to make sure it is done correctly. In many cases, this can be quite a source of bugs.

It is important to realize that NOSQL is not advocating a replacement of RDBMS which has been proven in many lines of application. The NOSQL should be considered a complementary technologies for some niche area where RDBMS is not covering well.

Fronting Tomcat with Apache or IIS

Summary

Running cluster of Tomcat servers behind the Web server can be demanding
task if you wish to archive maximum performance and stability.
This article describes best practices how to accomplish that.

By Mladen Turk

Fronting Tomcat

One might ask a question Why to put the Web server in front of Tomcat
at all? Thanks to the latest advances in Java Virtual Machines (JVM)
technology and the Tomcat core itself, the Tomcat standalone is quite
comparable with performance to the native web servers.
Even when delivering static content it is only 10%
slower than recent Apache 2 web servers.

The answer is: scalability.

Tomcat can serve many concurrent users by assigning a separate thread of
execution to each concurrent client connection. It can do that nicely but
there is a problem when the number of those concurrent connections rise.
The time the Operating System will spend on managing those threads will degrade
the overall performance. JVM will spend more time managing and switching those
threads then doing a real job, serving the requests.

Besides the connectivity there is one more significant problem, and it caused
by the applications running on the Tomcat. A typical application will process
client data, access the database, do some calculations and present the data
back to the client. All that can be a time consuming job that in most cases
must be finished inside half a second, to achieve user perception of a working
application. Simple math will show that for a 10ms application response time you
will be able to serve at most 50 concurrent users, before your users start
complaining. So what to do if you need to support more users?
The simplest thing is to buy a faster hardware, add more CPU or add more boxes.
A two 2-way boxes are usually cheaper then a 4-way one, so adding more boxes
is generally a cheaper solution then buying a mainframe.

First thing to ease the load from the Tomcat is to use the Web server
for serving static content like images, etc..

Figure 1.
Figure 1. Generic configuration

Figure 1. shows the simplest possible configuration scenario. Here the
Web server is used to deliver static context while Tomcat only does the
real job – serving application. In most cases this is all that you will need.
With 4-way box and 10ms application time you’ll be capable of serving 200
concurrent users, thus giving 3.5 million hits per day, that is by all
means a respectable number.

For that kind of load you generally do not need the Web server in front of
Tomcat. But here comes the second reason why to put the Web server in front, and
that is creating an DMZ (demilitarized zone). Putting Web server on a
computer host inserted as a “neutral zone” between a company’s private network
and the internet or some other outside public network gives the applications
hosted on Tomcat capability to access company private data, while securing
the access to other private resources.

Figure 2.
Figure 2. Secure generic configuration

Beside having DMZ and secure access to a private network there can
be many other factors like the need for the custom authentication for example.

If you need to handle more load you will eventually have to add more Tomcat
application servers. The reason for that can be either caused by the fact
that your client load just can not be handled by a single box or that you
need some sort of failover in case one of the nodes breaks.

Figure 3.
Figure 3. Load balancing configuration

Configuration containing multiple Tomcat application servers needs a load balancer
between web server and Tomcat. For Apache 1.3, Apache 2.0 and IIS Web servers
you can use Jakarta Tomcat Connector (also known as JK), because it offers
both software load balancing and sticky sessions. For the upcoming Apache 2.1/2.2
use the advanced mod_proxy_balancer that is a new module designed and integrated
within the Apache httpd core.


Calculating Load

When determining the number of Tomcat servers that you will need to satisfy
the client load, the first and major task is determining the Average Application
Response Time (hereafter AART). As said before, to satisfy the user experience
the application has to respond within half of second. The content received by the client
browser usually triggers couple of physical requests to the Web server (e.g. images). The
web page usually consists of html and image data, so client issues a series
of requests, and the time that all this gets processed and delivered is
called AART. To get most out of Tomcat you should limit the number of concurrent
requests to 200 per CPU.

So we can come with the simple formula to calculate the maximum
number of concurrent connections a physical box can handle:

                              500
    Concurrent requests = ( ---------- max 200 ) * Number of CPU's
                            AART (ms)

The other thing that you must care is the Network throughput between the
Web server and Tomcat instances. This introduces a new variable called
Average Application Response Size (hereafter AARS), that is the number of
bytes of all context on a web page presented to the user. On a standard
100Mbps network card with 8 Bits per Byte, the maximum theoretical
throughput is 12.5 MBytes.

                               12500
    Concurrent requests = ---------------
                            AARS (KBytes)

For a 20KB AARS this will give a theoretical maximum of 625 concurrent
requests. You can add more cards or use faster 1Gbps hardware if need
to handle more load.

The formulas above will give you rudimentary estimation of the number of
Tomcat boxes and CPU’s that you will need to handle the desired
number of concurrent client requests.
If you have to deploy the configuration without
having actual hardware, the closest you can get is to measure the AART on
a test platform and then compare the hardware vendor Specmarks.


Fronting Tomcat with Apache

If you need to put the Apache in front of Tomcat use the Apache2 with
worker MPM. You can use Apache1.3 or Apache2 with prefork MPM for handling
simple configurations like shown on the Figure 1. If you need to front
several Tomcat boxes and implement load balancing use Apache2 and worker
MPM compiled in.

MPM or Multi-Processing Module is Apache2 core feature and it is responsible
for binding to network ports on the machine, accepting requests,
and dispatching children to handle the requests.
MPMs must be chosen during configuration, and compiled into the server.
Compilers are capable of optimizing a lot of functions if threads are used,
but only if they know that threads are being used. Because some MPMs use threads
on Unix and others don’t, Apache will always perform better if the MPM is
chosen at configuration time and built into Apache.

Worker MPM offers a higher scalability compared to a standard prefork
mechanism where each client connection creates a separate Apache process.
It combines the best from two worlds, having a set of child processes each
having a set of separate threads. There are sites that are running
10K+ concurrent connections using this technology.

Connecting to Tomcat

In a simplest scenario when you need to connect to single Tomcat instance
you can use mod_proxy that comes as a part of every Apache distribution.
However, using the mod_jk connector will provide approximately double the performance.
There are several reasons for that and the major is that mod_jk manages a
persistent connection pool to the Tomcat, thus avoiding opening and closing
connections to Tomcat for each request. The other reason is that mod_jk uses a custom
protocol named AJP an by that avoids assembling and disassembling header
parameters for each request that are already processed on the Web server.
You can find more details about AJP
protocol on the
Jakarta Tomcat connectors
site.

For those reasons you can use mod_proxy only for the low load sites
or for the testing purposes. From now on I’ll focus on mod_jk for fronting
Tomcat with Apache, because it offers better performance and scalability.

One of the major design parameters when fronting Tomcat with Apache
or any other Web server is to synchronize the maximum number of concurrent
connections. Developers often leave default configuration values from both Apache and
Tomcat, and are faced with spurious error messages in their
log files. The reason for that is very simple. Tomcat and Apache can each accept only
a predefined number of connections. If those
two configuration parameters differs, usually with Tomcat having
lower configured number of connections, you will be faced with the
sporadic connection errors. If the load gets even higher, your users will
start receiving HTTP 500 server errors even if your hardware is capable
of dealing with the load.

Determining the number of maximum of connections to the Tomcat
in case of Apache web server depends on the MPM used.

MPM configuration parameter
Prefork MaxClients
Worker MaxClients
WinNT ThreadsPerChild
Netware MaxThreads

On the Tomcat side the configuration parameter that limits the number
of allowed concurrent requests is maxProcessors with default value of
20. This number needs to be equal to the MPM configuration parameter.

Load balancing

Load balancing is one of the ways to increase the number of concurrent
client connections to the application server. There are two types of
load balancers that you can use. The first one is hardware load balancer
and the second one is software load balancer. If you are using load balancing
hardware, instead of a mod_jk or proxy, it must support a compatible passive
or active cookie persistence mechanism, and SSL persistence.

Mod_jk has an integrated virtual load balancer worker that can contain
any number of physical workers or particular physical nodes.
Each of the nodes can have its own balance factor or the worker’s
quota or lbfactor. Lbfactor is how much we expect this worker
to work
, or the workers’s work quota.
This parameter is usually dependent on the hardware topology itself, and
it offers to create a cluster with different hardware node configurations.
Each lbfactor is compared to all other lbfactors in the cluster and its
relationship gives the actual load. If the lbfactors are equal the workers
load will be equal as well (e.g. 1-1, 2-2, 50-50, etc…). If first
node has lbfactor 2 while second has lbfactor 1, than the first node
will receive two times more requests than second one.
This asymmetric load configuration enables to have nodes with different
hardware architecture.

In the simplest load balancer topology with only two nodes in the
cluster, the number of concurrent connections on a web server side
can be as twice as high then on a particular node. But …

    1 + 1 != 2

The upper statement means that the sum of allowed connections on a
particular nodes does not give the total number of connections allowed.
This means that each node has to allow a slightly higher number of
connections than the desired total sum. This number is usually a
20% higher and it means that

    1 * 1.2 + 1 * 1.2 == 2

So if you wish to have a 100 concurrent connections with two nodes,
each of the node will have to handle the maximum of 60 connections.
The 20% margin factor is experimental, and depends on the Apache
server used. For prefork MPMs it can rise up to 50%, while for
the NT or Netware its value is 0%. The reason for that is that
each particular child process menages its own balance statistics
thus giving this 20% error for multiple child process web servers.

    worker.node1.type=ajp13
    worker.node1.host=10.0.0.10
    worker.node1.lbfactor=1

    worker.node2.type=ajp13
    worker.node2.host=10.0.0.11
    worker.node2.lbfactor=2

    worker.node3.type=ajp13
    worker.node3.host=10.0.0.12
    worker.node3.lbfactor=1

    worker.list=lbworker
    worker.lbworker.type=lb
    worker.lbworker.balance_workers=node1,node2,node3

The minimum configuration for a three node cluster shown in the
upper example will give the 25%-50%-25% distribution of the load,
meaning that the node2 will get as much load as the rest of the two members.
It will also impose the following number of maxProcessors for each particular
node in case of the MaxClients=200.

    node1 :
        <Connector ... maxProcessors="60" ... />
    node2 :
        <Connector ... maxProcessors="120" ... />
    node3 :
        <Connector ... maxProcessors="60" ... />

Using simple math the load should be 50-100-50 but we needed to add the
20% load distribution error. In case this 20% additional load is not sufficient,
you will need to set the higher value up to the 50%. Of course the average
number of connections for each particular node will still follow the
load balancer distribution quota.

Sticky sessions and failower

One of the major problems with having multiple backend
application servers is determining the client-server relationship.
Once the client makes a request to a server application that
needs to track user actions over a designated time period,
some sort of state has to be enforced inside a stateless http
protocol. Tomcat issues a session identifier that
uniquely distinguishes each user. The problem with that session
identifier is that he does not carry any information about the
particular Tomcat instance that issued that identifier.

Tomcat in that case adds an extra jvmRoute configurable
mark to that session. The jvmRoute can be any name that will
uniquely identify the particular Tomcat instance in the cluster.
On the other side of the wire the mod_jk will use that jvmRoute
as the name of the worker in it’s load balancer list. This means
that the name of the worker and the jvmRoute must be equal.

jvmRoute is appended to the session identifier :

http://host/app;jsessionid=0123456789ABCDEF0123456789ABCDEF.jvmRouteName

When having multiple nodes in a cluster you can improve your application
availability by implementing failover. The failover means that if the
particular elected node can not fulfill the request the another node
will be selected automatically. In case of three nodes you are actually doubling your
application availability. The application response
time will be slower during failover, but none
of your users will be rejected. Inside the mod_jk configuration there
is a special configuration parameter called worker.retries that has default value of 3, but
that needs to be adjusted to the actual number of nodes in the cluster.

    ...
    worker.list=lbworker
    worker.lbworker.type=lb
    # Adjust to the number of workers
    worker.retries=4
    worker.lbworker.balance_workers=node1,node2,node3,node4

If you add more then three workers to the load balancer
adjust the retries parameter to reflect that number.
It will ensure that even in the worse case scenario the request
gets served if there is a single operable node. Of course, the
request will be rejected if there are no free connections available on the
Tomcat side , so you should increase the allowed number of connections
on each Tomcat instance. In the three node scenario (1-2-1)
if one of the nodes goes down, the other
two will have to take its load. So if the load is divided equally you will need
to set the following Tomcat configuration:

    node1 :
        <Connector ... maxProcessors="120" ... />
    node2 :
        <Connector ... maxProcessors="160" ... />
    node3 :
        <Connector ... maxProcessors="120" ... />

This configuration will ensure that 200 concurrent connections will
always be allowable no matter which of the nodes goes down. The reason for
doubling the number of processors on node1 and node3 is because they
need to handle the additional load in case node2 goes down (load 1-1).
Node2 also needs the adjustment because
if one of the other two nodes goes down, the load will be 1-2. As you
can see the 20% load error is always calculated in.

Figure 4.
Figure 4. Three node example load balancer
Figure 5.
Figure 5. Failover for node2

As shown in the two figures above setting maxProcessors depends both
on 20% load balancer error and expected single node failure. The
calculation must include the node with the highest lbfactor as
the worst case scenario.

Domain Clustering model

Since JK version 1.2.8 there is a new domain clustering model and
it offers horizontal scalability and performance of tomcat cluster.

Tomcat cluster does only allow session replication to all nodes in the cluster.
Once you work with more than 3-4 nodes there is too much overhead and risk in
replicating sessions to all nodes. We split all nodes into clustered groups.
The newly introduced worker attribute domain let
mod_jk know, to which other nodes a session gets replicated (all workers with
the same value in the domain attribute). So a load balancing worker knows, on
which nodes the session is alive. If a node fails or is being taken down
administratively, mod_jk chooses another node that has a replica of the session.

For example if you have a cluster with four nodes you can make
two virtual domains and replicate the sessions only inside the domains.
This will lower the replication network traffic by half

Figure 6.
Figure 6. Domain model clustering

For the above example the configuration would look like:

    worker.node1.type=ajp13
    worker.node1.host=10.0.0.10
    worker.node1.lbfactor=1
    worker.node1.domain=A

    worker.node2.type=ajp13
    worker.node2.host=10.0.0.11
    worker.node2.lbfactor=1
    worker.node2.domain=A

    worker.node3.type=ajp13
    worker.node3.host=10.0.0.12
    worker.node3.lbfactor=1
    worker.node3.domain=B

    worker.node4.type=ajp13
    worker.node4.host=10.0.0.13
    worker.node4.lbfactor=1
    worker.node4.domain=B

    worker.list=lbworker
    worker.lbworker.type=lb
    worker.lbworker.balance_workers=node1,node2,node3,node4

Now assume you have multiple Apaches and Tomcats. The Tomcats are clustered and
mod_jk uses sticky sessions. Now you are going to shut down (maintenance) one
tomcat. All Apache will start connections to all tomcats. You end up with all
tomcats getting connections from all apache processes, so the number of threads
needed inside the tomcats will explode.
If you group the tomcats to domain as explained above, the connections normally
will stay inside the domain and you will need much less threads.


Fronting Tomcat with IIS

Just like Apache Web server for Windows, Microsoft IIS maintains
a separate child process and thread pool for serving concurrent client
connections. For non server products like Windows 2000 Professional or
Windows XP the number of concurrent connections is limited to 10.
This mean that you can not use workstation products for production
servers unless the 10 connections limit will fulfil your needs.
The server range of products does not impose that 10 connection
limit, but just like Apache, the 2000 connections is a limit when
the thread context switching will take its share and slow down the
effective number of concurrent connections.
If you need higher load you will need to deploy additional web servers
and use Windows Network Load Balancer (WNLB) in front of Tomcat servers.

Figure 7.
Figure 7. WNLB High load configuration

For topologies using Windows Network Load Balancer the same rules are in place
as for the Apache with worker MPM. This means that each Tomcat instance
will have to handle 20% higher connection load per node than its real lbfactor.
The workers.properties configuration must be
identical on each node that constitutes WNLB, meaning that you will have to
configure all four Tomcat nodes.


Apache 2.2 and new mod_proxy

For the new Apache 2.1/2.2 mod_proxy has been rewriten and has
a new AJP capable protocol module (mod_proxy_ajp) and integrated
software load balancer (mod_proxy_balancer).

Because it can maintain a constant connection pool to backed
servers it can replace the mod_jk functionality.

    LoadModule proxy_module modules/mod_proxy.so
    LoadModule proxy_ajp_module modules/mod_proxy_ajp.so
    LoadModule proxy_balancer_module modules/mod_proxy_balancer.so
    ...
    <Proxy balancer://mycluster>
        BalancerMember ajp://10.0.0.10:8009 min=10 max=100 route=node1 loadfactor=1
        BalancerMember ajp://10.0.0.11:8009 min=20 max=200 route=node2 loadfactor=2
    </Proxy>
    ProxyPass /servlets-examples balancer://mycluster/servlets-examples

The above example shows how easy is to configure a Tomcat cluster with
proxy loadbalancer. One of the major advantages of using proxy is the
integrated caching, and no need to compile external module.

Mod_proxy_balancer has integrated manager for dynamic parameter changes.
It offers changing session routes or disabling a node for maintenance.

    <Location /balancer-manager>
        SetHandler balancer-manager
        Order deny,allow
        Allow from localhost
    </Location>
Figure 8.
Figure 8. Changing BalancerMember parameters

The future development of mod_proxy will include the option to
dynamically discover the particular node topology. It will also allow
to dynamically update loadfactors and session routes.


About the Author

Mladen Turk is a Developer and Consultant for JBoss Inc in Europe, where he is
responsible for native integration. He is a long time commiter for Jakarta Tomcat Connectors,
Apache Httpd and Apache Portable Runtime projects.


Links and Resources

Jakarta Tomcat connectors documentation

Apache 2.0 documentation

Apache 2.1 documentation

Powered by WordPress | Theme: by 85ideas. Editor by Khoanguyen