When Investment billionaire Warren Buffet was young, he kept a ledger of container deliveries at the local docks, which he staked out himself. He knew that changes in shipping traffic  corresponded to changes in demand for a commodity.  this knowledge earned him  wealth, but also represented an early foray into market analytics – the process of analysing data in order to detect predictive patterns, or “signatures”.

 

Recent years have seen burgeoning growth in data production.  The financial sector, social media , and even the “Internet of things”, are generating more information every year. This inflation of information has coincided with reduced storage and computing costs, yielding a broad new ecosystem of big data analytics platforms.

Getting started with R

First, install R.

For a nice interface, install  RStudio.

Awesome tutorials for beginners

Introduction – RTutor.com

Beginners Guide to R – Computerworld

Introduction – Data Camp

 

Big data analytics is often thought of as a proprietary service that companies use to improve performance or investments. But it doesn’t have to be this way. Much like Buffet’s shipping ledger, the winners in big data analytics will be those who find the right data and interpret it the right way.

Expensive, proprietary software isn’t a prerequisite – in fact, it could even be a disadvantage. Why? Cutting edge research groups often release open source tools for pattern recognition and machine learning well before they are popularised or integrated into commercial analytics software.It’s now easier than ever for individuals or small businesses to get started with big data analytics. Python and R are both free, open source platforms that are ideal for setting up a basic data analytics pipeline, and they’re exceptionally well supported. But while they are overlapping in function, python and R tend to target different applications:

Python excels in data manipulation

R excels in data exploration

 

So if you’ve got a clear idea of what your data analysis pipeline needs to look like, then implement it in python. But if you want search for patterns in data or try out different algorithms, then R is the way to go.

R in the Cloud

Any self respecting analytics pipeline needs three things – processors, RAM and storage. If your don’t have all three, cloud computing services such as Amazon can set you up at very little cost – or could even be free, depending on on your requirements.

Big data analytics in the cloud with R

Advanced R deployment

Create interactive web applications with Shiny

Create interactive plots

Deploy interactive web apps on the cloud

For Cloud-R, check out this guide to running R on Amazon EC2

Scalable processing and storage

Consumption-based pricing

Install R Server on your local cluster

Free software from Microsoft

Runs on Hadoop, Suse, Linux or Windows

Why Cloud-R makes sense

1. Parallel processing on demand

Any respectable big – data analytics pipeline needs serious grunt. If you don’t have this at hand, Amazon’s Elastic Compute Cloud (EC2) is a great way to go – get all the power you need, and only pay for what you use. Amazon has a comprehensive guide for setting up R on EC2, and it is supported by popular R graphical interfaces such as Shiny and R studio. An EC2 t2.2xlarge instance will plough through your data, and will often work out cheaper than a cup of coffee.

2. Memory on demand

Plenty of RAM – R isn’t naturally conservative with memory. Sure, you can work around memory limitations, but one of R’s strengths is its powerful set of matrix operations, which tend to be memory- intensive. Again, cloud computing can be a far cheaper alternative to upgrading your  hardware – the Amazon EC2 t2x2xlarge boasts a whopping 32Gb of RAM – plenty for parallel processing of a massive data set.

3. Storage on demand

Big data storage. Typical applications range in the hundreds of gigabytes. Amazon’s EC2 offers various economical storage plans – or for even less you can get an Amazon S3 bucket – but note that setting it up to talk to EC2 requires an extra step.Big data storage with R

 

Managing big data storage with R

Storage of large data sets has as much to do with the underlying hardware and file system as it does with your analysis pipeline. For fast access to petabyte – scale data, you’ll probably need as distributed system such as Hadoop, which R server can run on top of. But for many applications, regular file storage will suffice.

 

In many instances, Rs native save() function will store information efficiently enough – providing there’s not too much of it. For large data sets, it is often more manageable to save multiple files in a directory using sequential saveRDS() calls. This function is preferable to save() because you don’t have to remember the name of the object stored – just load it back into memory using myObject <- loadRDS(filename)

 

You can parallel-process an entire directory of data like this:

 

 

Alternatives to file-based object storage in R

Database storage – R can interface with standard database servers using the DBI package.

Disk-based Array storage – For storing arrays of data in a common format, the RNetCDF package offers a flexible, platform – independent solution.

Compressed, tabulated data – the base functions read.table or read.csv both work with zipped tab or comma-separated files.

Accessing big data repositories using R

R packages are available that can interface with diverse data sources – here are some favourites:

Quandl

  • Free Historical Financial data
  • Dedicated R package for downloading data
  • Extensive API
  • Easy interface to Yahoo finance
  • Extensive catalog of free and premium data sets
  • Sign up for free and get up to 50,000 API calls per day

Google Trends

  • Track keyword popularity with Google trends
  • R package available: gtrendsR

Google Maps

Online shopping

  • Amazon and EBay both support extensive product data APIs
  • Get global product price and specification data

Social media

Biological sequence data

Bioconductor is the gateway to much of R’s genome data

Ten tips for processing big data with R

  1. Work with vectors where possible
  2. If you have big arrays of data, split it into small arrays. Or a list of vectors.
  3. Use xts or zoo to clean up/processes time series data in small batches. Concatenate into a list of vectors when you need to put it together.
  4. Combining many xts/zoo columns with  Do.call(cbind)  can crash  – consider generating a list of intermediate tables then binding those together.
  5. Parallel process with mclapply – and don’t reference globals because it increases the memory overhead of each spawned process.
  6. If mclapply is crashing, giving errors, or silently returning null, try these mclapply survival tips.
  7. Rbindlist from data.table is your friend. Use instead of rbind.
  8. Clean up unneeded objects with rm(). Sometimes gc() also helps to keep things tidy.

 

Scrub it with R: 7 tips for cleaning big data

How to handle spikes, jitters and missing data in financial/time series data

Missing or error-laden data can lead to false inferences, and such problems are only inflated as the scale of data increases. NA values – missing values – can crop up in stock or commodity prices when a market is  closed for the day. Fortunately, the zoo and tsoutlier packages for R have a number of handy functions to make your life easier. And if these don’t help, we’ve included a link at the end to a simple spike scrubber function.

1. na.fill()

Na.fill() is the simplest method for handling missing data. It will replace all NAs with a fixed value of your choice. This approach often creates more problems than it solves, however.

2. na.locf()

Use the “feed forward” strategy – replace NAs with prices from the previous day’s trading. The na.locf function from the zoo package will perform either feed-forward or feed-back on a time series object:

data <- na.locf(data)

data <- na.locf(data, fromLast =TRUE)  ### observations are carried backwards

Unfortunately the na.locf function tends to run slow when processing very large data sets. This is because it generates a full copy of the data in memory. Memory handling requirements can be reduced by performing na.locf on array segments

3. spline interpolation()

Spline interpolation offers a more sophisticated alternative to the feed forward approach for handling missing values. Rather than simply duplicating the preceding observation, spline interpolation uses a cubic function to predict the missing value. The na.spline function from the zoo package fits a curve through time using known values, and allows reasonable estimates of missing values. However performance is slow compared with na.fill or na.locf.

4. na.approx()

Similar to na.spline, na.approx attempts to estimate the missing values – but the predictive function is linear rather than cubic. Ideal for log-transformed data.

5. rollapply()

Jittery data isn’t always a problem. If the jitters are real, they might tell you something. But if it’s a technical artifact giving you the jitters, you may choose to smooth out your data using functions such as rollmean() or rollmedian () from zoo. Alternatively, you can “roll your own” with the base function rollapply(), which also offers the na.rm option to ignore missing values.

6. tsoutlier package

Spikes can occur in data for many reasons – sometimes, it makes sense to remove them. But cleaning up spikes or jitters in your data can be complicated. That’s because there’s no magical cutoff between real spikes and erroneous spikes. The tsoutlier package for R provides flexible routines for detecting spikes from time series data. This is done by fitting a time series model to the data and identifying outliers. Using this procedure with require some knowledge of the intricacies of time series models.

7. Simple function for scrubbing spikes in R

If the above packages can’t handle your big data cleaning requirements, you might find this simple spike-scrubbing function useful.

Machine learning

Having cleaned up your data, it’s time to start on the analytics. Machine learning algorithms are ideal for analysing patterns and trends in big data as they make few assumptions about the underlying distribution.

R offers a diverse array of packages for pattern detection, modeling and predictive learning. Some are more polished than others, and some will perform better than others when applied to very large data sets. Furthermore, some will work “out of the box” on on certain types of data but not others.

Getting started with machine learning

Machine learning algorithms expect a data set to “train” from. Typically, the training data comprises a set of variables (or “features”) in columns, with each row corresponding to an observation. Each variable/column represents a different category of observation.  One column – the “response” variable – should contain the data that you want the model to predict.

 

 

 

Example: Modelling child behavior after drinking colored beverages (ELI5). In this example, kids were treated with different colored drinks. Multiple types of measurement were taken for each kid – age, weight, height, beverage, and number of tantrums in a day. Each type of measurement will correspond to a variable/column in our model. Can you guess which is the response variable? That’s right, it’s number of tantrums. The remaining columns are termed “predictor” variables. We want to use these to predict what the child’s response will be. The different types of measurements can help to reduce confounding effects. In this example we’ll need a lot more observations (kids) before we can draw solid conclusions.

 

 

In other applications, we might not be interested in which predictors are important so long as the group as a whole is predictive. That’s fine, but bear in mind that with more variables comes more noise. So choose your predictors wisely – if you can.

Popular machine learning package for R

 

randomForest

Like it says on the box, randomForest implements a random forest machine learning algorithm. Data is modeled by growing and pruning forests of decision trees. In my experience, this is one of the faster and more sensitive machine learning algorithms.

  • Easy to get started with
  • Lots of plotting and diagnostic capabilities.
  • Internal cross-validation: automatic assessment of model performance.

 

nnet

Nnet implements a neutral network with a “hidden layer” of nodes between the input and output layers. The hidden layer increases the complexity of the network, enabling non-linear pattern detection.

 

carot

The caret package provides a convenient gateway to tuning and testing a range of machine learning algorithms. It offers a single function – train() – that you plug your data into.

  • Evolving repertoire of machine learning algorithms.
  • Supports parallel processing
  • Easy parameter tuning

 

kernlab

Kernlab is a suite of machine learning tools that caters for both supervised and unsupervised classification problems.

  • Support vector machine algorithm for supervised classification
  • Spectral clustering algorithm for unsupervised clustering
  • Other more exotic machine learning methods:
    • Gaussian processes
    • Kernel principle component analysis

 

 

 

Well that’s it for our big data mega R overview – phew! Hopefully you can find some pointers here to get you started. Did we get something wrong or leave out your favorite package? Hit us up in the comments below!