The industry now has a buzzword, “BIG DATA,” for how we’re going to do something with the huge amount of information piling up. “Big data” is replacing “Business Intelligence,” which subsumed “Reporting,” which put a nicer gloss on “Spreadsheets,” which beat out the old-fashioned “Printouts.” Managers who long ago studied printouts are now hiring mathematicians who claim to be big data specialists to help them solve the same old problem: What’s selling and why and How to sell more?
It’s not fair to suggest that these buzzwords are simple replacements for each other. Big data is a more complicated world because the scale is much larger…
The information is usually spread out over a number of servers, and the work of compiling the data must be coordinated among them. In the past, the work was largely delegated to the database software, which would use its magical JOIN (I would just randomly insert SQL queries here and there just bear with me..Maybe I’ll do a post on that ) mechanism to compile tables, then add up the columns before handing off the rectangle of data to the reporting software that would paginate it.
This was often harder than it sounds. Database programmers can tell you the stories about complicated JOIN commands that would lock up their database for hours as it tried to produce a report for the boss who wanted his columns just so… *sigh
The game is much different now. Hadoop is a popular tool for organizing the racks and racks of servers, and NoSQL ( It’s “Not Only SQL” and not “NO SQL”!!!) databases are popular tools for storing data on these racks.
These mechanism can be much more powerful than the old single machine, but they are far from being as polished as the old database servers. Although SQL may be complicated, writing the JOIN query for the SQL databases was often much simpler than gathering information from dozens of machines and compiling it into one coherent answer. Hadoop jobs are written in Java, and that requires another level of sophistication. The tools for tackling big data are just beginning to package this distributed computing power in a way that’s a bit easier to use.
Many of the big data tools are also working with NoSQL data stores. These are more flexible than traditional relational databases, but the flexibility isn’t as much of a departure from the past as Hadoop. NoSQL queries can be simpler because the database design discourages the complicated tabular structure that drives the complexity of working with SQL. The main worry is that software needs to anticipate the possibility that not every row will have some data for every column.
That has to come from within you or the other humans working on the project. Understanding the data and finding the right question to ask is often much more complicated than getting your Hadoop job to run quickly. That’s really saying something because these tools are only half of the job.
The Jaspersoft package is one of the open source leaders for producing reports from database columns. The software is well-polished and already installed in many businesses turning SQL tables into PDFs that everyone can scrutinize at meetings… *sigh
Once you get the data from these sources, Jaspersoft’s server will boil it down to interactive tables and graphs. The reports can be quite sophisticated interactive tools that let you drill down into various corners. You can ask for more and more details if you need them.
This is a well-developed corner of the software world, and Jaspersoft is expanding by making it easier to use these sophisticated reports with newer sources of data. Jaspersoft isn’t offering particularly new ways to look at the data, just more sophisticated ways to access data stored in new locations.
Pentaho is another software platform that began as a report generating engine; it is, like JasperSoft, branching into big data by making it easier to absorb information from the new sources. You can hook up Pentaho’s tool to many of the most popular NoSQL databases such as MongoDB and Cassandra. Once the databases are connected, you can drag and drop the columns into views and reports as if the information came from SQL databases.
Pentaho also provides software for drawing HDFS file data and HBase data from Hadoop clusters. One of the more intriguing tools is the graphical programming interface known as either Kettle or Pentaho Data Integration. It has a bunch of built-in modules that you can drag and drop onto a picture, then connect them. Pentaho has thoroughly integrated Hadoop and the other sources into this, so you can write your code and send it out to execute on the cluster.
Many of the big data tools did not begin life as reporting tools. Karmasphere Studio, for instance, is a set of plug-ins built on top of Eclipse. It’s a specialized IDE that makes it easier to create and run Hadoop jobs.
Karmasphere also distributes a tool called Karmasphere Analyst, which is designed to simplify the process of plowing through all of the data in a Hadoop cluster. It comes with many useful building blocks for programming a good Hadoop job, like subroutines for uncompressing Zipped log files. Then it strings them together and parametrizes the Hive calls to produce a table of output for perusing.
Talend also offers an Eclipse-based IDE for stringing together data processing jobs with Hadoop. Its tools are designed to help with data integration, data quality, and data management, all with subroutines tuned to these jobs.
Talend Studio allows you to build up your jobs by dragging and dropping little icons onto a canvas. If you want to get an RSS feed, Talend’s component will fetch the RSS and add proxying if necessary. There are dozens of components for gathering information and dozens more for doing things like a “fuzzy match.” Then you can output the results.
Not all of the tools are designed to make it easier to string together code with visual mechanisms. Skytree offers a bundle that performs many of the more sophisticated machine-learning algorithms. All it takes is typing the right command into a command line.
Skytree is more focused on the guts than the shiny GUI. Skytree Server is optimized to run a number of classic machine-learning algorithms on your data using an implementation the company claims can be 10,000 times faster than other packages. It can search through your data looking for clusters of mathematically similar items, then invert this to identify outliers that may be problems, opportunities, or both. The algorithms can be more precise than humans, and they can search through vast quantities of data looking for the entries that are a bit out of the ordinary. This may be fraud — or a particularly good customer who will spend and spend.
The free version of the software offers the same algorithms as the proprietary version, but it’s limited to data sets of 100,000 rows. This should be sufficient to establish whether the software is a good match.
Tableau Desktop is a visualization tool that makes it easy to look at your data in new ways, then slice it up and look at it in a different way. You can even mix the data with other data and examine it in yet another light. The tool is optimized to give you all the columns for the data and let you mix them before stuffing it into one of the dozens of graphical templates provided.
Splunk is a bit different from the other options. It’s not exactly a report-generating tool or a collection of AI routines, although it accomplishes much of that along the way. It creates an index of your data as if your data were a book or a block of text. Yes, databases also build indices, but Splunk’s approach is much closer to a text search process.
After wading through these products, it became clear that “big data” was much bigger than any single buzzword. It’s not really fair to lump together products that largely build tables with those that attempt complicated mathematical operations. Nor is it fair to compare simpler tools that work with generic databases with those that attempt to manage larger stacks spread out over multiple machines in frameworks like Hadoop.