Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

Professional Services

Second Signal

Presentations

Andrew's Blog

Support

Data Dashboard challenge in Notes - Better sometimes, than using an RDB

By Andrew Pollack on 07/19/2008 at 10:50 AM EDT

This challenge is about dealing with a significant volume of raw data, and reporting on in a useful way. The fundamental difference in working with data in Notes versus working with it in a relational database, is that while typically an RDB can handle more raw data, viewing that raw data through reports only gives you exactly what you ask for. Notes on the other hand presents data in "Views" which are much more organic -- particularly if you've designed them well. You don't just see what you're asking for, you see the context in which that data exists. To do that in an RDB you start having to look at very expensive data mining tools.

The Challenge

It's time to set up a monitoring dashboard for Second Signal. The system has gotten busy enough that I want to start monitoring the total call volume at any given time across departments, the average calls per department, and the range of call duration across users.

This tool will allow me to monitor telephony use by department, by user, and by concurrency -- that is to say, at any moment, how many channels are open to the system from the telephone world at the same time. As Second Signal scales, the server costs on a per department basis really don't increase very much. There is a terrific economy of scale there. The telephony side, however, does tend to increase in line with the number of users up to a certain point. Once that tipping point is reached however, I can move to a different model of purchasing inbound telephony that is much more effectively pooled. This tool will help me move toward that point.

Asterisk, the telephony switch I'm using, produces excellent logging data as you would expect. One option I have is to simply parse that logging data to get the statistics I need. There are also API calls that can query the switch in real time to see things like call volume and status. I want the data a little differently however, so I'm doing some logging of my own as part of the call scripting. That's not really an issue. The issue is storing that raw data in a way that is most useful for producing the reports and statistics that I want.

Deciding between Notes and an RDB

Relational Databases are generally better had handling large volumes of structured data, while Notes is generally better at storing "Information" -- which is data that has already been processed into something meaningful to people. There is a myth that a database is going to be faster for storing data. That's only partly true. Storing the data isn't hard at all. Retrieving it in a meaningful way is where the performance gets more interesting. With an RDB you have to construct reports fed by queries. Those queries take longer and are harder on the server based on their complexity. In Notes, the query and report work is built into the overall performance because most people consider it all in the same scope. Its all in the "Notes Database" so its all one thing.

Where it makes sense to store data in Notes is when you can know in advance the kind of reporting and visualization you're going to want out of the system and you can pre-build that into the storage mechanism. If I just stored each phone call as a unique entity in Notes, it would end up being a very large volume of data that wasn't well suited to giving me the kinds of reports I want. In that case, an RDB would be the better choice. If I can find a way to store the data in Notes in a way that is meaningful to people and can drive views easily, then Notes will be the more effective storage choice.

Determine what you want to store

The first thing you want to do if you're going to be storing data in Notes, is to think about what aspect of that data contains the core value yet will be the hardest to bring to together in a single view entry. In my case, it isn't the specifics of a single phone call. It turns out that the single phone call record actually holds almost no value to me.

What matters to me is concurrency. In other words, what was going on at a given moment. I want to be able to see how busy the system was at any given moment and what the trend of activity is over time and over certain periods of time. By coming to that conclusion, I've just identified the data records I want to store. I want to store "Concurrency Records". I want one record which will show me the concurrency for a specific moment.

Determine the granularity to store

With concurrency, you have a tricky problem. How long a period are you watching for each measure of concurrent activity? If you stored individual call records with a start and end time, good reporting tools could give you a curve rather than a numeric chart but that's not necessary in this case. Again, it goes back to what matters to my particular use case. That's billing. Telephony billing is generally done is six second increments -- one tenth of a minute each. That means if I want meaningful data, I need it to be stored at the level of "Concurrent Activity per Six Second Interval". This will give me a view on data which tells me how busy the system was during each period for which I have to shell out money.

Fit the data storage to the platform

The most obvious way to store "Activity per Six Second Interval" into Notes would be to use a single document for each interval. Unfortunately, that would mean 5,256,000 records per year or nearly half a million documents per month. That's not a very good model for Notes, though its feasible in newer versions. Reporting on that kind of Notes database would be cumbersome. The unstructured nature of Notes really helps with that. First of all, I don't need to store any records for time periods containing no activity. Second, I don't have to store a single Note for each interval. I considered storing one document per hour but that isn't granular enough. I end up with very few documents ( 8760 per year ) but too many fields on each document.

What I decided to do is to store one document per minute with one field on the document for each period within that minute. In other words, 10 fields. This balances the overall document count (about half a million per year) with the ease of reporting I need. If each field stores the concurrency data (i.e. 5 active calls) for each 1/10 of a minute, reporting is easy and document volume isn't excessive.

Make the most out of the data you store

Before I got started writing, however, there was one more thing I wanted to consider. I need to know not just how many calls were happening during each interval, but I wanted to know who was making them. I want to be able to report on which department was using the system and also within the departments which telephone numbers were calling in. In a relational database, I would do this with a complex "Join" in the query. In Notes, I'm taking advantage of the ability to pre-build that data and keep it available. To do that, I decided to create three fields for each interval. One field would be a list of the active calls each department had open at the time. The second field would be a list of each callerid that was active during that interval. The third would be a simple count of the total number of active calls at that time. The third would is essentially just an "@Elements()" function of either of the other two fields. That could be done just as easily in a view column, however by storing it on the record I avoid that calculation for every document in every view every time the view is rebuilt. Since the data won't change, it is much better to pre-build it and store it on the document.

The resulting document fields

Interval Form One Document Per Minute

Interval Concurrency Departments Callers

EXAMPLE 3 DEPT0001
DEPT0002
DEPT0003 555-555-1001
555-555-1002
555-555-1003

Document Fields

0-5 Seconds Block_0_Total Concurrency Block_0_Departments Block_0_CallerIds

6-11 Seconds Block_1_Total Concurrency Block_1_Departments Block_1_CallerIds

12-17 Seconds Block_2_Total Concurrency Block_2_Departments Block_2_CallerIds

18-23 Seconds Block_3_Total Concurrency Block_3_Departments Block_3_CallerIds

24-29 Seconds Block_4_Total Concurrency Block_4_Departments Block_4_CallerIds

30-35 Seconds Block_5_Total Concurrency Block_5_Departments Block_5_CallerIds

36-41 Seconds Block_6_Total Concurrency Block_6_Departments Block_6_CallerIds

42-47 Seconds Block_7_Total Concurrency Block_7_Departments Block_7_CallerIds

48-53 Seconds Block_8_Total Concurrency Block_8_Departments Block_8_CallerIds

54-59 Seconds Block_9_Total Concurrency Block_9_Departments Block_9_CallerIds

The documents will also have a few other fields to make views easier to create. For example, they will have a "TimePeriod" field which will be the date and time which is covered by the document. They'll also have that time broken down in the "Year, Month, Day, Hour, and Minute" fields which are numeric, and also MonthName and DayName text fields.

That's seven fields which could easily be calculated in view columns or selection formulas. By storing them on the document, it amounts to hundreds of thousands of formula iterations saved on each view re-index or database scan (in the case of view selection formulas).

The result of all this analysis and planning is that I'll have a data record set which exactly matches my reporting needs and is very easy for the system to keep indexed. All of this is possible without the cost and complexity of using an external relational database.

Getting the most out of a platform comes from really understanding three things. The platform, The Data, and the Desired Result. Without a full and careful examination of all three, you won't get as good a result.

There are - loading - comments....

Subject
Your Name
Homepage
*Your Email
	* Your email address is required, but not displayed.

Your thoughts....


	Remember Me

Andrew Pollack's Blog

Data Dashboard challenge in Notes - Better sometimes, than using an RDB

Site Links

Useful Links

Other Recent Stories...

Interval Form	One Document Per Minute

Interval	Concurrency	Departments	Callers

EXAMPLE	3	DEPT0001 DEPT0002 DEPT0003	555-555-1001 555-555-1002 555-555-1003
	Document Fields

0-5 Seconds	Block_0_Total Concurrency	Block_0_Departments	Block_0_CallerIds
6-11 Seconds	Block_1_Total Concurrency	Block_1_Departments	Block_1_CallerIds
12-17 Seconds	Block_2_Total Concurrency	Block_2_Departments	Block_2_CallerIds
18-23 Seconds	Block_3_Total Concurrency	Block_3_Departments	Block_3_CallerIds
24-29 Seconds	Block_4_Total Concurrency	Block_4_Departments	Block_4_CallerIds
30-35 Seconds	Block_5_Total Concurrency	Block_5_Departments	Block_5_CallerIds
36-41 Seconds	Block_6_Total Concurrency	Block_6_Departments	Block_6_CallerIds
42-47 Seconds	Block_7_Total Concurrency	Block_7_Departments	Block_7_CallerIds
48-53 Seconds	Block_8_Total Concurrency	Block_8_Departments	Block_8_CallerIds
54-59 Seconds	Block_9_Total Concurrency	Block_9_Departments	Block_9_CallerIds