[Bismark-devel] [Hndr] Dates are really tricky!!

Mon Apr 18 10:11:10 EDT 2011

Dave,

Django does allow native SQL, but the cool thing about it is that most queries can be done with Python methods, that essentially do introspection on the database tables.  It's pretty slick.  Switching from mysql to postgres (when we are ready) is a one-line change.

I like the idea of generating summary tables per day.  That would _really_ speed things up, I think, and most people probably are just looking at the most recent queries by default.  Is this in the bug database yet?  I'll be happy to file it. :-)  Who is working on this part?

-Nick

On Apr 18, 2011, at 8:46 AM, Dave Taht wrote:

> On 04/18/2011 06:18 AM, Dave Taht wrote:
>> On 04/17/2011 11:57 PM, Nick Feamster wrote:
>> 
>>> I think we can whip up some quick visualizations of what's in the DB with a nice Web page with some concerted python hacking.  For starters, it would be nice to have a version of the data whereby queries can actually complete in a reasonable time.  :-)  What's the best way to do this?  Split the table by months?
>>> 
>> I did look over one of the slower (30+ seconds!) queries and suggested that a between statement be used instead of an and for the date range, but we never got around to testing that and mysql used to not be able to do anything sane with a between range (postgres can).
>> 
>> e.g:
>> 
>> mysql>  select dstport, sum(uppkts) as up_p, sum(dwpkts) as down_p, sum(upbytes) as up_bytes, sum(dwbytes) as down_bytes from FLOWS_newformat where tsstart>unix_timestamp('2011-2-6') and tsstart<unix_timestamp('2011-2-20') and deviceid='NB105' group by dstport order by down_bytes desc limit 20;
>> 
>> MIGHT do better with:
>> 
>> select dstport, sum(uppkts) as up_p, sum(dwpkts) as down_p, sum(upbytes) as up_bytes, sum(dwbytes) as down_bytes from FLOWS_newformat where tsstart between unix_timestamp('2011-2-6') and unix_timestamp('2011-2-20') and deviceid='NB105' group by dstport order by down_bytes desc limit 20;
>> 
>> 
>> I'd not be huge on divvying up tables by month, but creating summary tables by day and by hour would be a good idea. Usage patterns tend to be self-similar across day of week and by hour of day.
>> 
>> I note that it is important to be able to capture hour of day AT the location (EDT/MDT/PDT), as that is more self-similar that hour of day (UTC)
>> 
> To be clearer, what I suggest above is not solved by what I wrote below, and the below contained an error.
> 
> Coping with dates can be really tricky and should not be undertaken before my first cup of coffee.
> 
> http://www.postgresql.org/docs/8.2/static/datatype-datetime.html
> 
>> So the above becomes a nightly cron job in a specialized summary table, something like (untested)
>> 
>> drop table device_port_daily_summary;
>> create table device_port_daily_summary as
>> 
> select deviceid, dstport, day(tsstart) as test_day, EXTRACT(YEAR FROM TIMESTAMP tsstart) as test_year , sum(uppkts) as up_p, sum(dwpkts) as down_p, sum(upbytes) as up_bytes, sum(dwbytes) as down_bytes from FLOWS_newformat group by deviceid, dstport, test_day, test_year;
> 
> The way described below has a day-range of 0-365, which wasn't what you wanted. Days since the epoch or the start of database time would probably be better than explicitly storing the year... but I don't remember the good way of doing that.
> 
> Dates are tricky.
> 
>> select deviceid, dstport, day(tsstart) as test_day, sum(uppkts) as up_p, sum(dwpkts) as down_p, sum(upbytes) as up_bytes, sum(dwbytes) as down_bytes from FLOWS_newformat group by deviceid, dstport, test_day
> 
> would be wrong.
> 
> _______________________________________________
> HNDR mailing list
> HNDR at lists.pixilab.org
> http://lists.pixilab.org/listinfo.cgi/hndr-pixilab.org