Creating Powerful Tools for Data Scientists
I served as the primary contributor to the R package stream, a framework for data stream modeling and associated data mining tasks such as clustering and classification.
Part of my work for stream was published in Studies in Classification, Data Analysis, and Knowledge Organization. Stream was also the focus of my senior thesis.
On the right, is an example of the package in action from its documentation.
stream <- DSD_Benchmark(1) animate_data(stream, horizon=100, n=5000, xlim=c(0,1), ylim=c(0,1)) ### animations can be replayed with the animation package library(animation) animation::ani.options(interval=.1) ## change speed ani.replay() ### animations can also be saved as HTML, animated gifs, etc. saveHTML(ani.replay()) ### animate the clustering process with evaluation ### Note: we choose to ignor noise points even if the algorithm would assign ### them to a cluster reset_stream(stream) dbstream <- DSC_DBSTREAM(r=.04, lambda=.1, gaptime=100, Cm=3, shared_density=TRUE, alpha=.2) animate_cluster(dbstream, stream, horizon=100, n=5000, measure="crand", type="macro", assign="micro", noise = "ignor", plot.args = list(xlim=c(0,1), ylim=c(0,1), shared = TRUE))
The power of stream…
Below are some of the features I developed for the package.
Evaluation
Stream allows users to run data-stream clustering algorithms on the same dataset and compare various evaluation metrics to see which one performed the best.
Animation
To help data scientists view the evolution of a data-stream and its clusters over a period of time, stream can export videos like the one at the top of the page.
Artificial Streams
It is difficult to produce real datasets that clearly exhibit behaviors such as merging and splitting of clusters, so stream provides a rich framework to make artificial data.