decor
 

Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.

 

October 17, 2017

Cloud Avenue Hadoop Tips

Microsoft Azure for AWS Professionals

A few months back I blogged about 'GCP for AWS Professionals' comparing the two platforms here. Now, Microsoft has published something similar comparing Microsoft Azure with Amazon AWS here.

It's good to know for Amazon AWS when their competitors are comparing their services with Amazon's. AWS has been adding new services and features (small and big) within them at a very rapid pace. Here you can get the new features introduced in Amazon AWS on a daily basis.

Similar to GCP and AWS, Azure also gives free credit to get started. So, now is the time to create an account and get started with Cloud. So, here are the links for the same (AWS, Azure and GCP).
Cloud Avenue Hadoop Tips

Getting the execution times of the EMR Steps from the CLI

In the previous blog, we executed a Hive script to convert the Airline dataset from the original csv format to Parquet Snappy format. And then same query were run to csv and the Parquet Snappy format data to see the performance improvements. This involved three steps.

Step 1 : Create the ontime and the ontime_parquet_snappy table. Move the data from ontime table to the ontime_parquet_snappy table for the conversion of one format to another.

Step 2 : Execute the query on the ontime table, which represents the csv data.

Step 3 : Execute the query on the ontime_parquet_snappy time, which representa the Parquet Snappy data.

The execution time for the above three steps was got from the AWS EMR management console which is a Web UI. All the tasks which can be done from the AWS management console can also be done from the CLI (Command Line Interface) also. Lets see the steps involved to get the execution time for the steps in EMR.

Step 1 : Install the AWS CLI for the appropriate OS. Here are the instructions for the same.

Step 2 : Generate the Security Credentials. These are used to make calls from the SDK and CLI. More about Security Credentials here and how to generate them here.

Step 3 : Configure the AWS CLI by specifying the Security Credentials and the Region by running the 'aws config' command. More details here.

Step 4 : From the prompt execute the below command to get the cluster-id of the EMR cluster.
aws emr list-clusters --query 'Clusters[*].{Id:Id}'

Step 5 : For the above cluster-id get the step-id by executing the below command.
aws emr list-steps --cluster-id j-1WNWN0K81WR11 --query 'Steps[*].{Id:Id}'

Step 6 : For one of the above step-id get the start and the end time and so the execution time for the step.
aws emr describe-step --cluster-id j-1WNWN0K81WR11 --step-id s-3CTY1MTJ4IPRP --query 'Step.{StartTime:Status.Timeline.StartDateTime,EndTime:Status.Timeline.EndDateTime}'

The above commands might look a bit cryptic, but it's easy once you get started. The documentation for the same is here. As noticed, I have created a Ubuntu Virtual machine on top of Windows and executing the commands in Ubuntu.
Cloud Avenue Hadoop Tips

Different ways of executing the Big Data processing jobs in EMR

There are different ways of kick starting a Hive/Pig/MR/Spark on Amazon EMR. We already looked at how to submit a Hive job or a step from the AWS EMR management console here. This approach is cool, but doesn't have much scope for automation.

Here are the other ways to start the Big Data Processing with some level of automation.

1) Use Apache Oozie to create a workflow and a coordinator.
2) Use the AWS CLI
3) Login to the master instance and use the Hive shell

In the above, Option 1 is a bit complicated and will be explored in another blog. Here we will be looking at the other two options.

Option 2 : Using the AWS CLI

Step 1 : Create the airline.sql with the below content. The below will create a table in Hive and map it to the data in S3. To get the data into S3 follow this article. Then a query will be run on the table.
create external table ontime_parquet_snappy (
Year INT,
Month INT,
DayofMonth INT,
DayOfWeek INT,
DepTime INT,
CRSDepTime INT,
ArrTime INT,
CRSArrTime INT,
UniqueCarrier STRING,
FlightNum INT,
TailNum STRING,
ActualElapsedTime INT,
CRSElapsedTime INT,
AirTime INT,
ArrDelay INT,
DepDelay INT,
Origin STRING,
Dest STRING,
Distance INT,
TaxiIn INT,
TaxiOut INT,
Cancelled INT,
CancellationCode STRING,
Diverted STRING,
CarrierDelay INT,
WeatherDelay INT,
NASDelay INT,
SecurityDelay INT,
LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin; 
Step 2 : Put the above file into the master node using the below command.
aws emr put --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --src "/home/praveen/Desktop/airline.sql"
Don't forget to replace the cluster-id, the path of the key-pair and the sql file in the above command.

Step 3 : Kick start the Hive program using the below command.
aws emr ssh --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --command "hive -f airline.sql"
Replace the cluster-id and the key-pair path in the above command.

Step 4 : The last and the final step is to monitor the progress of the Hive job and verify the output in the S3 management console.



Option 3 : Login to the master instance and use the Hive shell

Step 1 : Delete the output of the Hive query which has been created in the above Option.

Step 2 : Follow the steps mentioned here to ssh into the master.

Step 3 : Start the Hive shell using the 'hive' command are create a table in Hive as shown below.


Step 4 : Check if the table has been created or not as shown below using the show and the describe SQL commands.


Step 5 : Execute the Hive query in the shell and wait for it to complete.



Step 6 : Verify the output of the Hive job in S3 management console.

Step 7 : Forward the local port to the remote port as mentioned here and access the YARN console, to see the status of the Hive job.


This completes the steps for submitting a Hive job in different ways. The same steps can be repeated with minimum changes for Pig, Sqoop and other Big Data softwares also.
Cloud Avenue Hadoop Tips

EMR logging into the master instance

Once we spawn a Cluster as mentioned here, we should see the instances in the EC2 management console. It would be nice to login to the master instance. All the log files are generated on the master and then moved to S3. Also, the different Big Data processing jobs can be run from the master command line interface.

In this blog we will look into connecting to the master. The AWS documentation for the same is here.

Step 1 : Click on the gear button on the top right. The columns in the page can be added or deleted here.


Include the EMR related keys as shown in the right of the above screen shot and the EC2 instance roles (MASTER and CORE) will be displayed as shown below.


Get the DNS hostname of the master instance after select it.

Step 2 : From the same EC2 management console, modify the Security Group associated with the master instance to allow inbound port 22 as shown below.




Step 3 : Now ssh into the master as shown below. Note that the DNS name of the master has to be changed.
ssh -i /home/praveen/Documents/AWS-Keys/MyKeyPair.pem hadoop@ec2-54-147-238-2.compute-1.amazonaws.com

Step 4 : Go to the '/mnt/var/log/' and check the different log files.

In the upcoming blog, we will explore running a Hive script from the master itself once we have logged into it.
 

October 16, 2017


Revolution Analytics

My interview with ROpenSci

The ROpenSci team has started publishing a new series of interviews with the goal of “demystifying the creative and development processes of R community members”. I had the great pleasure of being...

...

Revolution Analytics

An AI pitches startup ideas

Take a look at this list of 13 hot startups, from a list compiled by Alex Bresler. Perhaps one of them is the next Juicero? FAR ATHERA: A CLINICAL AI PLATFORM THAT CAN BE ACCESSED ON DEMAND. ZAPSY:...

...
Cloud Avenue Hadoop Tips

How to get familiar with the different AWS Services quickly?

AWS has got a lot of services and they are introducing new services and a lot of features within them at a very quick pace. It's a difficult task to get in pace with them. New features (small and big) are introduced almost. daily. Here is a blog to get updated on the latest services and features in AWS across different services. In this blog, you will notice that almost every day there is something new.

AWS documentation comes with 'Getting Started' guides/tutorials as the name says to get started with the different AWS Service quickly and don't go into too much of detail. For those who want to become an AWS Architect, an understanding of the different AWS Services is quite essential and these 'Getting Started' guides/tutorials are helpful for the same.

The 'Getting Started' guides/tutorials for different AWS Services have a different URL pattern and so is difficult to figure out. So, a quick Google search with the below URL will find out all the AWS 'Getting Started' guides/tutorials in the AWS documentation for the different services. Click on the Next in the search page to get a few more of them.


https://www.google.co.in/search?q=getting+started+site%3Aaws.amazon.com

Again, I would strongly recommend going through the above  'Getting Started' guides/tutorials for the wannabe AWS Architects. Hope it helps.
 

October 15, 2017

Principa

Machine Learning is here to stay

We have a bit of a joke in the office around how data scientists in 2027 will have a good laugh at what we define as ‘Big Data’ in 2017.  Pat pat, there there, I guess that was Big Data back then.  Unlike the term Big Data, Machine Learning is here to stay.  It is after all one of the foundations of Artificial Intelligence and this is rapidly becoming more and more part of our culture.  The impact of Machine Learning is being felt on a daily basis, from using interactive devices like Amazon’s Echo to do our shopping, learning a language through DuoLingo, or interacting with chatbots to get your statement in under a second instead of waiting “for the next available agent”.  So what has happened, why the recent explosion of Machine Learning applications?


Simplified Analytics

Digital in the Age of Customer !!

Today’s customers are like a King. They are in the driver’s seat, with the power of the internet and social media tools creating tons of information. With the strong influence of smartphones, all the...

...
Cloud Avenue Hadoop Tips

aws-summary : Different ways of replicating data in S3

S3 has different storage classes with different durability and availability as mentioned here. S3 provides very high durability and availability, but if more is required then CRR (Cross Region Replication) can be used. CRR as the name, replicates the data automatically across buckets in two different regions to provide even more durability and availability.


Here are a few resources around CRR.

About and using CRR

Monitoring the CRR

The below two approaches are for replicating S3 data within the same region which CRR doesn't support.

Content Replication Using AWS Lambda and Amazon S3

Synchronizing Amazon S3 Buckets Using AWS Step Functions

Nothing new here, but a way of consolidating resources around replication of S3 data. AWS has a lot of good resources to get started and also with advanced topics, but they are dispersed and difficult to find out. I will updating this blog as I find new resources around this topic and also based on the comments in this blog.

Also, the above articles/blogs use multiple services from AWS, so it would be a nice way to get familiar with them.
 

October 14, 2017


Forrester Blogs

Basic Infrastructure Automation is a No-Brainer – Why Aren’t You Doing It?

Everyone knows they need to automate. That is not a thought-provoking statement. In fact, when we ask infrastructure decision leaders their top priorities, automation shows up at #3, just behind...

...
 

October 13, 2017


Revolution Analytics

Because it's Friday: Line Rider

Line Rider is a simple web-based game: draw a line (or a series of lines), and watch an animated sledder ride along it like it was a snow slope. It's remained much the same since it was created in...

...

Forrester Blogs

Help Wanted: Data Innovation For The Data Economy

Thomas Edison once said, “The value of an idea is in the using of it.”  Today (many of) those “ideas” are data and the insights derived from them, and it remains true that their value is in how they...

...
 

October 12, 2017

Silicon Valley Data Science

Taming Busyness

I’ve been in the “working world” for almost 30 years, and one thing that is common to every job I’ve ever had is this: time will control if not controlled.

At SVDS, our agile nature means we need to iterate quickly and manage our time accordingly. There are many emotions involved with time management—worry, annoyance, relief—but it’s easier to achieve success if you act more tactically. In this post I share some tips for avoiding calendar panic, learned from years as an executive administrator.

Look to the future

I can’t overemphasize the importance of looking ahead. I have a system that includes looking to the upcoming month on the calendars I manage and building in pockets for project time, travel, and personal needs. Next, every Friday afternoon I look ahead to the upcoming week and identify any overlaps, problems or inconsistencies. Finally, at the end of each day I look to the next day to make sure what’s on “paper” makes sense. Of course this varies person to person and the different calendars I manage certainly reflect the personality of the owner. But time looking ahead is time well invested.

What’s important?

Prioritize, prioritize, prioritize. Everyone has some method of prioritization, but not everyone’s calendars reflect those priorities. I look at calendars as fluid tools to reflect the priorities of the day (which speak to the priorities of the week, month, year, etc). There is no shame in moving, rescheduling, or canceling things that don’t push you toward what is important. I’ve learned so much from leaders I support in watching how they move through their demands.

Be flexible

In that vein, flexibility is your friend. Calendars are a tool which allow us to manage the unmanageable. While time can’t be slowed or hastened, it can be leveraged to work in your favor. I’ve never met a person who is more productive while in the throes of frustration— deadlines, yes; pressure, yes; but frustration, no. Frustration is many times the result of trying to stick to something that isn’t working.

Be flexible with your calendar. In the workplace, you don’t want to be the one who doesn’t respect calendar appointments, but great leaders foster like-minded priorities within an organization and this underscores the need for flexibility when appropriate.

Busy vs productive

Finally, there is a distinct difference between being busy and being productive. Productivity trumps busyness every time, and my favorite rule in winning the productivity race is the well known “touch it once” principle. Applying this one methodology can make a significant impact in creating space within your time. Even when you must re-visit something, the simple awareness of the fact that this is repetitive is valuable.

Conclusion

These are just a few tips that can help you manage a busy calendar, and an unending todo list. I’ve included a couple links below to more resources. Everyone responds to pressure a little differently—share your time management tips in the comments.

The post Taming Busyness appeared first on Silicon Valley Data Science.


Revolution Analytics

A cRyptic crossword with an R twist

Last week's R-themed crossword from R-Ladies DC was popular, so here's another R-related crossword, this time by Barry Rowlingson and published on page 39 of the June 2003 issue of R-news (now known...

...
Cloud Avenue Hadoop Tips

Is it on AWS?

I am really fascinated by Serverless model. There is no need to think in terms of the servers, no need to scale, cheap when compared to server hosting etc etc. I agree there are some issues like cold start of the containers when the Lambda or in fact any any serverless function is invoked after a long time.

I came across this 'Is it on AWS?' a few months back and bookmarked it, but didn't get a chance to try it out. It uses a Lambda function to tell if a particular domain name or ip address is in the is in the published list of AWS IP address ranges. The site has links to a blog and code for the same, so would not like to expand it here.

Have fun reading the blog !!!!

Forrester Blogs

Learn About The Power Of Live Video With Experts From Experian, Procore, and Performics – Forrester Panel Webinar

If you’ve applied for a mortgage or bought a car, it’s likely that your lender checked your credit through Experian. The company has extensive expertise in how to build and maintain good...

...
Cloud Avenue Hadoop Tips

Creating an Application Load Balancer and querying it's logging data using Athena

When building a highly scalable website like amazon.com, there would be thousands of web servers and all of them would be fronted by multiple load balancers as shown below. The end user would be pointing to the load balancer which would forward the requests to the web servers. In the case of the AWS ELB (Elastic Load Balancer), the distribution of the traffic from load balancer to the servers is in a round-robin fashion and doesn't consider the size of the server or how busy/idle the servers are. May be AWS will add this feature in the upcoming releases.


In this blog, we would be analyzing the number of users coming to a website from different ip addresses. Here are the steps at a high level which we would be exploring in a bit more detail. This is again a lengthy post where would be using a couple of AWS services (ELB, EC2, S3 and  Athena) and see how they work together.
   
    - Create two Linux EC2 instance with web servers with different content
    - Create an Application Load Balancer and forward the requests to the above web servers
    - Enable the logging on the Application Load Balancer to S3
    - Analyze the logging data using Athena

To continue further, the following can be done (not covered in this article)

    - Create a Lambda function to call the Athena query at regular intervals
    - Auto Scale the EC2 instances depending on the resource utilization
    - Remove the Load Balancer data from s3 after a certain duration

Step 1: Create two Linux instances and install web servers as mentioned in this blog. In the /var/www/html folder have the files as mentioned below. Ports 22 and 80 have to be opened for accessing the instance through ssh and for accessing the web pages in the browser.

     server1 - index.html
     server2 - index.hml and img/someimage.png

Make sure that ip-server1, ip-server2 and ip-server2/img/someimage.png are accessible from the web browser. Note that the image should be present in the img folder. The index.html is for serving the web pages and also for the health check, while the image is for serving the web pages.



Step 2: Create the Target Group.



Step 3: Attach the EC2 instances to the Target Group.




Step 4: Change the Target Group's health checks. This will make the instances healthy faster.


Step 5: Create the second Target Group. Associate server2 with the target-group2 as mentioned in the flow diagram.


Step 6: Now is the time to create the Application Load Balancer. This balancer is relatively new when compared to the Classic Load Balancer. Here is there difference between the different Load Balancers. The Application Load Balancer operates at the layer 7 of the OSI and supports host-based and path-based routing. Any web requests with '/img/*' pattern would be sent to the target-group2, rest by default would be sent to target-group1 after completing the below settings.







Step 7: Associate the target-group1 with the Load Balancer, the target-group2 will be associated later.





Step 8: Enable access logs on the Load Balancer by editing the attributes. The specified S3 bucket for storing the logs will be automatically created.



Few minutes after the Load Balancer has been created, the instances should turn into a healthy state as shown below. If not, then maybe one of the above steps has been missed.


Step 9: Get the DNS name of the Load Balancer and open it in the browser to make sure that the Load Balancer is working.


Step 10: Now is the time to associate the second Target Group (target-group2). Click on View/edit rules) and add a rule.





Any requests with the path /img/* would be sent to the target-group2, rest of them would be redirected to the target-group2.



Step 11: Once the Load Balancer has been accessed from different browsers a couple of times, the log files should be generated in S3 as shown below.


Step 12: Now it's time to create tables in Athena and then map it to the data in S3 and query the tables. The DDL and the DML commands for Athena can be found here.




We have seen how to create a Load Balancer, associate Linux web servers with them and finally check how to query the log data with Athena. Make sure that all the AWS resources which have been created are deleted to stop the billing for them.

That's it for now.
Cloud Avenue Hadoop Tips

AWS SES for sending emails using Java

We had been using emails for marketing some of the product offerings we have. Instead of creating our own email Server for sending emails, we had been using AWS SES for the same. In this blog, I would be posting the code (Java) and the configuration files to send emails in bulk using the AWS SES. Sending emails through SES is quite cheap, more details about the pricing here.

Here I am going with the assumption that the readers of this blog are a bit familiar with Java, AWS and Maven. They should also be having an account with AWS. If you don't have an AWS account, here are the steps.

Step 1: Login to the AWS SES Management Console and get the email address verified. This should be the same email address from which the emails will be sent. The AWS documentation for the same is here.


Step 2: Go to the AWS SQS Management Console and create a Queue. All the default setting should be good enough.


Step 3: Go to the AWS SNS Management Console and create a Topic. For this tutorial 'SESNotifications' topic has been created.


Step 4: Select the topic which has been created in the previous step and go to 'Actions -> Subscribe to topic'. The SQS Queue Endpoint (ARN) can be got from the previous step.


Step 5: Go to the SES Management Console and create a Configuration Set as shown below. More about the Configuration Sets here and here.


With the above configuration when an email is sent then any Bounce, Click, Complaint, Open events will be sent to the SNS topic and from there it will go to the SQS Queue. The above sequence completes the steps to be done in the AWS Management console. Now will look at the code and configuration files for sending the emails. The below files are required for sending the emails

File 1 : SendEmailViaSES.java - Uses the AWS Java SDK API to send emails.
package com.thecloudavenue;

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.util.Properties;
import java.util.StringTokenizer;
import java.util.regex.Pattern;

import org.apache.velocity.Template;
import org.apache.velocity.VelocityContext;
import org.apache.velocity.app.Velocity;

import com.amazonaws.regions.Region;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.simpleemail.AmazonSimpleEmailServiceClient;
import com.amazonaws.services.simpleemail.model.Body;
import com.amazonaws.services.simpleemail.model.Content;
import com.amazonaws.services.simpleemail.model.Destination;
import com.amazonaws.services.simpleemail.model.Message;
import com.amazonaws.services.simpleemail.model.SendEmailRequest;

public class SendEmailViaSES {

public static void main(String[] args) {

System.out.println("Attempting to send emails through Amazon SES .....");

try {

if (args.length == 0) {
System.out.println(
"Proper Usage is: java -cp jar_path com.thecloudavenue.SendEmailViaSES config.properties_path");
System.exit(0);
}

File f = new File(args[0]);
if (!f.isFile()) {
System.out.println(
"Please make sure that the config.properties is specified properly in the command line");
System.exit(0);
}

System.out.println("\n\nLoading the config.properties file .....");
Properties prop = new Properties();
InputStream input = null;
input = new FileInputStream(args[0]);
prop.load(input);

String template_file_path = new String(prop.getProperty("template_file_path"));
String template_file_name = new String(prop.getProperty("template_file_name"));
f = new File(template_file_path + "\\\\" + template_file_name);
if (!f.isFile()) {
System.out.println(
"Please make sure that the template_file_path and template_file_name are set proper in the config.properties");
System.exit(0);
}

String email_db_path = new String(prop.getProperty("email_db_path"));
f = new File(email_db_path);
if (!f.isFile()) {
System.out.println("Please make sure that the email_db_path is set proper in the config.properties");
System.exit(0);
}

String from_address = new String(prop.getProperty("from_address"));
String email_subject = new String(prop.getProperty("email_subject"));
Long sleep_in_milliseconds = new Long(prop.getProperty("sleep_in_milliseconds"));
String ses_configuration_set = new String(prop.getProperty("ses_configuration_set"));

System.out.println("Setting the Velocity to read the template using the absolute path .....");
Properties p = new Properties();
p.setProperty("file.resource.loader.path", template_file_path);
Velocity.init(p);

AmazonSimpleEmailServiceClient client = new AmazonSimpleEmailServiceClient();
Region REGION = Region.getRegion(Regions.US_EAST_1);
client.setRegion(REGION);

System.out.println("Getting the Velocity Template file .....");
VelocityContext context = new VelocityContext();
Template t = Velocity.getTemplate(template_file_name);

System.out.println("Reading the email db file .....\n\n");
FileInputStream fstream = new FileInputStream(email_db_path);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine, destination_email, name;

int count = 0;

while ((strLine = br.readLine()) != null) {

count++;

// extract the email from the line
StringTokenizer st = new StringTokenizer(strLine, ",");
destination_email = st.nextElement().toString();

// Check if the email is valid or not
Pattern ptr = Pattern.compile(
"(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*(?:(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*))*)?;\\s*)");
if (!ptr.matcher(destination_email).matches()) {
System.out.println("Invalid email : " + destination_email);
continue;
}

// Figure out the name to be used in the email content
if (st.hasMoreTokens()) {
// if the email db has the name use it
name = st.nextElement().toString();
} else {
// if not then use the string before @ as the name
int index = destination_email.indexOf('@');
name = destination_email.substring(0, index);
}

Destination destination = new Destination().withToAddresses(destination_email);

// Use the velocity template to create the html
context.put("name", name);
StringWriter writer = new StringWriter();
t.merge(context, writer);

// Create the email content to be sent
Content subject = new Content().withData(email_subject);
Content textBody = new Content().withData(writer.toString());
Body body = new Body().withHtml(textBody);
Message message = new Message().withSubject(subject).withBody(body);
SendEmailRequest request = new SendEmailRequest().withSource(from_address).withDestination(destination)
.withMessage(message).withConfigurationSetName(ses_configuration_set);

// Send the email using SES
client.sendEmail(request);

System.out
.println(count + " -- Sent email to " + destination_email + " with name as " + name + ".....");

// Sleep as AWS SES puts a limit on how many email can be sent per second
Thread.sleep(sleep_in_milliseconds);

}

in.close();

System.out.println("\n\nAll the emails sent!");

} catch (Exception ex) {
System.out.println("\n\nAll the emails have not been sent. Please send the below error.");
ex.printStackTrace();
}
}
}
File 2 : GetMessagesFromSQS.java - Uses the AWS Java SDK API to get the Bounce, Click, Complaint, Open events from the SQS Queue.
package com.thecloudavenue;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;

import com.amazonaws.regions.Regions;
import com.amazonaws.services.sqs.AmazonSQS;
import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
import com.amazonaws.services.sqs.model.DeleteMessageRequest;
import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
import com.amazonaws.services.sqs.model.Message;
import com.amazonaws.services.sqs.model.ReceiveMessageRequest;

public class GetMessagesFromSQS {

public static void main(String[] args) throws Exception {

System.out.println("Attempting to get messages from Amazon SQS .....");

try {

if (args.length == 0) {
System.out.println(
"Proper Usage is: java -cp jar_path com.thecloudavenue.GetMessagesFromSQS config.properties_path");
System.exit(0);
}

File f = new File(args[0]);
if (!f.isFile()) {
System.out.println(
"Please make sure that the config.properties is specified properly in the command line");
System.exit(0);
}

System.out.println("\n\nLoading the config.properties file .....");
Properties prop = new Properties();
InputStream input = null;
input = new FileInputStream(args[0]);
prop.load(input);

String message_output_file_path = new String(prop.getProperty("message_output_file_path"));
String sqs_queue_name = new String(prop.getProperty("sqs_queue_name"));

AmazonSQS sqs = AmazonSQSClientBuilder.standard().withRegion(Regions.US_EAST_1).build();
String myQueueUrl = sqs.getQueueUrl(sqs_queue_name).getQueueUrl();

int approximateNumberOfMessages = 0;

GetQueueAttributesResult gqaResult = sqs.getQueueAttributes(
new GetQueueAttributesRequest(myQueueUrl).withAttributeNames("ApproximateNumberOfMessages"));
if (gqaResult.getAttributes().size() == 0) {
System.out.println("Queue " + sqs_queue_name + " has no attributes");
} else {
for (String key : gqaResult.getAttributes().keySet()) {
System.out.println(String.format("\n%s = %s", key, gqaResult.getAttributes().get(key)));
approximateNumberOfMessages = Integer.parseInt(gqaResult.getAttributes().get(key));

}
}

FileWriter fstream = new FileWriter(message_output_file_path, true);
BufferedWriter out = new BufferedWriter(fstream);

ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl);
receiveMessageRequest.setMaxNumberOfMessages(10);

int pendingNumberOfMessages = approximateNumberOfMessages;

for (int i = 1; i <= approximateNumberOfMessages; i++) {

List messages = sqs.receiveMessage(receiveMessageRequest).getMessages();
int count = messages.size();
System.out.println("\ncount == " + count);

for (Message message : messages) {

System.out.println("Writing the message to the file .....");
out.write(message.getBody());

System.out.println("Deleting the message from the queue .....");
String messageRecieptHandle = message.getReceiptHandle();
sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));

}

pendingNumberOfMessages = pendingNumberOfMessages - count;
System.out.println("pendingNumberOfMessages = " + pendingNumberOfMessages);

if (pendingNumberOfMessages <= 0) {
break;
}
}

out.close();

System.out.println("\n\nGot all the messages into the file");

} catch (Exception ex) {
System.out.println("\n\nAll the messages have not been got from the queue. Please send the below error.");
ex.printStackTrace();
}
}
}
File 3 : emaildb.txt - List of emails. It can also take the name of the person after the comma to send a customized email. The last two emails are used to test bounce and complaints, more here.
praveen4cloud@gmail.com,praveen
praveensripati@gmail.com
bounce@simulator.amazonses.com
complaint@simulator.amazonses.com
File 4 : email.vm - The email template which has to be sent. The $name in the below email template will be replaced with the name from the above file. If the name is not there then the email id will be used in place of the $name.
<html>
<body>
Dear $name,<br/><br/>
Welcome to thecloudavenue.com <a href="http://www.thecloudavenue.com/">here</a>.<br/><br/>
Thanks,
Praveen
</body>
</html>
File 5 : config.properties - The properties to configure the Java program. Note that the path may be need to be modified as per where the files have been placed.
email_db_path=E:\\WorkSpace\\SES\\sendemail\\resources\\emaildb.txt
message_output_file_path=E:\\WorkSpace\\SES\\sendemail\\resources\\out.txt

from_address="Praveen Sripati" <praveensripati@gmail.com>
email_subject=Exciting oppurtunities in Big Data
template_file_path=E:\\WorkSpace\\SES\\sendemail\\resources
template_file_name=email.vm
sleep_in_milliseconds=100

ses_configuration_set=email-project
sqs_queue_name=SESQueue
File 6 : pom.xml - The maven file with all the dependencies.
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.thecloudavenue</groupId>
<artifactId>sendemail</artifactId>
<version>0.0.1-SNAPSHOT</version>
<name>sendemail</name>
<dependencies>
<dependency>
<groupId>org.apache.velocity</groupId>
<artifactId>velocity</artifactId>
<version>1.7</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.11.179</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>amazon-kinesis-client</artifactId>
<version>1.2.1</version>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.0.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
File 7 : credentials file in the profile folder. I have kept mine in the C:\Users\psripati\.aws folder under windows. Create the access keys as mentioned here. The format of the credentials file is mentioned here. Note that it's better to create an IAM user with the permissions to send SES emails and to read messages from the SQS queue. The access keys have to be created for this user.

Compile the java files and package them as a jar file. Now lets see how to run the program.

For sending the emails, open the command prompt and run the below command. Note to replace the correct sendemail-0.0.1-SNAPSHOT.jar and the config.properties path.

java -cp E:\WorkSpace\SES\sendemail\target\sendemail-0.0.1-SNAPSHOT.jar com.thecloudavenue.SendEmailViaSES E:\WorkSpace\SES\sendemail\resources\config.properties

For getting the list of complaints, bounces, opened and clicked emails, open the command prompt and run the below command. Note to replace the correct sendemail-0.0.1-SNAPSHOT.jar and the config.properties path.

java -cp E:\WorkSpace\SES\sendemail\target\sendemail-0.0.1-SNAPSHOT.jar com.thecloudavenue`.GetMessagesFromSQS E:\WorkSpace\SES\sendemail\resources\config.properties

I know that  there are a sequence of steps to get started for using SES, but once the whole setup has been done it should be piece of cake to use SES for sending emails. On top of every thing, it's a lot cheaper sending emails via SES than setting up a email server.

Note : This article has used Apache Velocity which is a Java based template engine for sending customized emails to the email recipients. AWS SES has included this functionality in the SDK itself, so there no need to use Apache Velocity. Here is a blog on the same.

Forrester Blogs

Vendor Landscape: Enterprise Marketing Software Suites In Asia Pacific

More marketers in Asia Pacific (AP) are investing in an enterprise marketing software suite (EMSS) to better engage empowered customers and drive contextual marketing — but EMSS vendors’ solution...

...
 

October 11, 2017


Forrester Blogs

How To Make Smarter MarTech Investments

Forrester has covered marketing technology for 15 years, and I’ve been here for nearly 11 of them. Over the years, the MarTech landscape has exploded and what we have today is frankly an...

...

Revolution Analytics

Tutorial: Azure Data Lake analytics with R

The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the...

...

Forrester Blogs

Your Customers Want Intelligent Agents But You Must Win Trust With Security

We live in time where your car can order your morning latte from Starbucks, and Alexa can order your dinner from Domino’s. Early-stage intelligent agents, also called virtual assistants or digital...

...

Forrester Blogs

Global Tech Market Will Grow By 4% In 2018, Reaching $3 Trillion

Forrester’s has published its mid-year global tech market outlook for 2017 and 2018 (see “Midyear Global Tech Market Outlook For 2017 To 2018”). In constant currencies, we project that global...

...
Cloud Avenue Hadoop Tips

Support for other languages in AWS Lambda

According to the AWS Lambda FAQ.

Q: What languages does AWS Lambda support?

AWS Lambda supports code written in Node.js (JavaScript), Python, Java (Java 8 compatible), and C# (.NET Core). Your code can include existing libraries, even native ones. Please read our documentation on using Node.js, Python, Java, and C#.


So, I was under the impression that AWS Lambda only supports  the languages mentioned in the FAQ documentation. But, other languages are also supported in Lambda with some effort. Here is the AWS blog on the same. Basically it's a Node.js wrapper for invoking the program in one of the language which is not supported by default by AWS Lambda. There would be an overhead with the above approach, but not sure how much.


Cloud Avenue Hadoop Tips

Converting Airline dataset from the row format to columnar format using AWS EMR

To process Big Data huge number of machines are required. Instead of buying them, it's better to process the data in the Cloud as it provides lower CAPEX and OPEX costs. In this blog we will at processing the airline data set in the AWS EMR (Elastic MapReduce). EMR provides Big Data as a service. We don't need to worry about installing, configuring, patching, security aspects of the Big Data software. EMR takes care of them, just we need specify the size and the number of the machines in the cluster, the location of the input/output data and finally the program to run. It's as easy as this.

The Airline dataset is in a csv format which is efficient for fetching the data in a row wise format based on some condition. But, not really efficient when we want to do some aggregations. So, we would be converting the CSV data into Parquet format and then run the same queries on the csv and Parquet format to observe the performance improvements.

Note that using the AWS EMR will incur cost and doesn't fall under the AWS free tier as we would be launching not the t2.micro EC2 instances, but a bit bigger EC2 instances. I will try to keep the cost to the minimum as this is a demo. Also, I prepared the required scripts ahead and tested them in the local machine on small data sets instead of the AWS EMR. This will save the AWS expenses.

So, here are the steps

Step 1 : Download the Airline data set from here and uncompress the same. All the data sets can be downloaded and uncompressed. But, to keep the cost to the minimum I downloaded the 1987, 1989, 1991, 1993 and 2007 related data and uploaded to S3 as shown below.



Step 2 : Create a folder called scripts and upload them to S3.


The '1-create-tables-move-data.sql' script will create the ontime and the ontime_parquet_snappy table, map the data to the table and finally move the data from the ontime table to the ontime_parquet_snappy table after transforming the data from the csv to the Parquet format. Below is the SQL for the same.
create external table ontime (
Year INT,
Month INT,
DayofMonth INT,
DayOfWeek INT,
DepTime INT,
CRSDepTime INT,
ArrTime INT,
CRSArrTime INT,
UniqueCarrier STRING,
FlightNum INT,
TailNum STRING,
ActualElapsedTime INT,
CRSElapsedTime INT,
AirTime INT,
ArrDelay INT,
DepDelay INT,
Origin STRING,
Dest STRING,
Distance INT,
TaxiIn INT,
TaxiOut INT,
Cancelled INT,
CancellationCode STRING,
Diverted STRING,
CarrierDelay INT,
WeatherDelay INT,
NASDelay INT,
SecurityDelay INT,
LateAircraftDelay INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://airline-dataset/airline-csv/';

create external table ontime_parquet_snappy (
Year INT,
Month INT,
DayofMonth INT,
DayOfWeek INT,
DepTime INT,
CRSDepTime INT,
ArrTime INT,
CRSArrTime INT,
UniqueCarrier STRING,
FlightNum INT,
TailNum STRING,
ActualElapsedTime INT,
CRSElapsedTime INT,
AirTime INT,
ArrDelay INT,
DepDelay INT,
Origin STRING,
Dest STRING,
Distance INT,
TaxiIn INT,
TaxiOut INT,
Cancelled INT,
CancellationCode STRING,
Diverted STRING,
CarrierDelay INT,
WeatherDelay INT,
NASDelay INT,
SecurityDelay INT,
LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE ontime_parquet_snappy SELECT * FROM ontime;
The '2-run-queries-csv.sql' script will run the query on the ontime table which maps to the csv data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/csv-query-output' select Origin, count(*) from ontime where DepTime > CRSDepTime group by Origin;
The '3-run-queries-parquet.sql' script will run the query on the ontime_parquet_snappy table which maps to the Parquet-Snappy data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin;
Step 3 : Goto the EMR management console and click on the 'Go to advanced options'.


Step 4 : Here select the software to be installed on the instances. For this blog we need Hadoop 2.7.3 and Hive 2.1.1. Make sure these are selected, the rest are optional. Here we can add a step. According to the AWS documentation, this is the definition of Step - 'Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.'. This can be a MR program, Hive Query, Pig Script or something else. The steps can be added here or later. We will add steps later. Click on Next.


Step 5 : In this step, we can select the number of instances we want to run and the size of each instance. We will leave them as default and click on next.


Step 6 : In this step, we can select additional settings like the cluster name, the S3 log path location and so on. Make sure the 'S3 folder' points to the log folder in S3 and click on Next. Uncheck the 'Termination protection' option.


Step 7 : In this screen again all the default options are good enough. If we want to ssh into the EC2 instances then the 'EC2 key pair' has to be selected. Here are the instructions on how to create a key pair. Finally click on 'Create cluster' to launch the cluster.


Initially the cluster will be in a Starting state and the EC2 instances will be launched as shown below.



Within a few minutes, the cluster will be in a running state and the Steps (data processing programs) can be added to the cluster.


Step 8 : Add a Step to the cluster by clicking on the 'Add step' and pointing to the '1-create-tables-move-data.sql' file as shown below and click on Add. The processing will start on the cluster.



The Step will be in a Pending status for some time and then move to the Completed status after the processing has been done.



Once the processing has been complete the csv data will be converted into a Parquet format with Snappy compression and put into S3 as shown below.


Note that the csv data was close to 2,192 MB and the Parquet Snappy data is around 190 MB. The Parquet data is in columnar format and provides higher compression compared to the csv format. This enables to fit more data into the memory and so quicker results.

Step 9 : Now add 2 more steps using the '2-run-queries-csv.sql' and the '3-run-queries-parquet.sql'. The first sql file will run the query on the csv data table and the second will run the query on the Parquet Snappy table. Both the queries are the same, returning the same results in S3.

Step 10 : Check the step log files for the steps to get the execution times in the EMR management console.

Converting the CSV to Parquet Snappy format - 148 seconds
Executing the query on the csv data - 90 seconds
Executing the query on the Parquet Snappy data - 56 seconds

Note that the query runs faster on the Parquet Snappy data, when compared to the csv data. I was expecting the query to run a bit faster, need to look into this a bit more.

Step 11 : Now that the processing has been done, it's time to terminate the cluster. Click on Terminate and again on Terminate. It will take a few minutes for the cluster to terminate.


Note that the EMR cluster will be terminated and EC2 instances will also terminate.



Step 12 : Go back to S3 management console the below folders should be there. Clean up by deleteing the bucket. I would be keeping the data, so that I can try Athena and RedShift on the CSV and the Parquet Snappy data. Note that 5GB of S3 data can be stored for free upto one year. More details about the AWS free tier here.


In the future blogs, we will look at processing the same data using AWS Athena. With Athena there is no need to spawn a cluster, so the serverless model. AWS Athena will automatically spawn the servers. We simply create a table, map it to the data in S3 and run the SQL queries on it.

With the EMR the pricing is rounded to the hour and for executing a query about 1 hour and 5 minutes, we need to pay for complete 2 hours. With Athena we pay by the amount of the data scanned. So, changing the data into a columnar format and compressing it, will not only make the query run a bit faster, but also cut down the bill.

UpdateHere and here are articles from the AWS documentation on the same. It has got some additional commands.

Update : Here is an approach using Spark/Python on EMR for converting the row format to Columnar format.
 

October 10, 2017


Revolution Analytics

R's remarkable growth

Python has been getting some attention recently for its impressive growth in usage. Since both R and Python are used for data science, I sometimes get asked if R is falling by the wayside, or if R...

...

Forrester Blogs

Emotion and the B2B brand experience

If I asked you for a list brands that might air a commercial during the Super Bowl, you’d probably include a brand like Pepsi, perhaps a Tom Brady-endorsed Foot Locker, or yet another mega consumer...

...

Forrester Blogs

Mobile-First Is Not Enough

Over the past few years, “mobile first” has become the new marketing imperative. However, few B2C marketers are good at executing this concept. More importantly, focusing solely on mobile-first...

...
 

October 09, 2017


Forrester Blogs

Check Out: Forrester Wave Data Resiliency Solutions

I am pleased to announce that the new Forrester Wave™: Data Resiliency Solutions, Q3 2017 for infrastructure and operations professionals is now live! This Wave evaluation uncovered a market in which...

...
Cloud Avenue Hadoop Tips

Automating EBS Snapshot creation

In one of the previous blog, we looked about attaching an EBS Volume to a Linux EC2 instance. You can think of EBS Volume has a hard disk, multiple of which can be attached to the Linux EC2 instances for storing the data.
The EBS Volumes where the data is stored can be corrupted or there can be some sort of failure. So, it's better to take a Snapshot (backup) from the Volume at regular intervals depending upon the requirement. In case of any failures, the Volume can be created from the Snapshot as shown below. Here is the AWS documentation on how to create an Snapshot and here is the documentation for restoring the corrupted/failed Volume from a healthy Snapshot. Here is a good documentation on what Snapshot is all about.
ec2 ebs life cycle
The Snapshots can be created manually using the CLI and API, but it would be good to automate the creation of the Snapshots. There are multiple approaches for the same, which we will look into it.

Approach 1 : This is the easiest approach of all without any coding as mentioned here. This uses CloudWatch events and is not really flexible, it just gives the option to take Snapshots at regular intervals using Cron expression. Other that that it doesn't give much flexibility. Lets say we want to take Snapshots of EC2 instances which have been tagged for backup out of all the EC2 instances then this approach would not be good enough.

Approach 2 : This approach is a bit more flexible, but we need to code and need to be familiar with the AWS Lambda service. Java, Python, C# and Node.js can be uses as of now against the Lambda service. These articles (1, 2) give details on creating a Lambda function using Python and trigger it at regular intervals to create a Snapshot out of the Volume. The articles are a bit old, but the procedure is more of less the same. The articles mention about scheduling in the Lambda Management Console, but the scheduling has to be done from the CloudWatch Events as mentioned here.

Simplified Analytics

Are you drowning in Data Lake?

Today more than ever, every business is focusing on collecting the data and applying analytics to be competitive. Big Data Analytics has passed the hype stage and has become the essential part of...

...

Simplified Analytics

Machine Learning - The brain of Digital Transformation

We are all familiar with machine learning in our everyday lives. Both Amazon and Netflix use machine learning to learn our preferences and provide a better shopping and movie experience. Artificial...

...

Simplified Analytics

How do you measure the success of Digital Transformation?

Digitization is disrupting every business and is spreading like a wildfire across every sector such as Banking, Financial Services, Insurance, Retail, and Manufacturing. Digital Transformation does...

...

Simplified Analytics

How machine learning APIs are impacting businesses?

In this Digital age, every organization is trying to apply machine learning and artificial intelligence to their internal and external data to get actionable insights which will help them to be...

...

Simplified Analytics

How to build a successful Digital Transformation roadmap?

Digital Transformation is a phenomenon that every company must deal with and it is a reality. It is a top priority for boardroom executives. Most companies know that digital transformation is vital...

...
 

October 08, 2017

Principa

The Story of Fashion Retail Credit in South Africa


Most industries owe their levels of sophistication to the visionaries in their space.  The South African credit industry is no different.  Whilst the bureaux and the banks have played a significant role in developing the South African credit landscape, arguably the fashion retailers have also played a pioneering role in revolving credit. And so our vibrant industry owes much to the role of the fashion retailers.  But how did it all begin?


Simplified Analytics

How Robotic Process Automation helping Digital Age

Digital has brought in so many technological advances to this age and one of them is Robotic Process Automation (RPA). A simple definition of RPA is, automation of business processes across the...

...
 

October 06, 2017


Forrester Blogs

Forrester.com Experienced A Cybersecurity Incident

Today, we announced that Forrester.com experienced a cybersecurity incident this week. To date, our investigation has determined that the attack was limited to research reports made available to...

...

Revolution Analytics

Because it's Friday: Death Risk

Humans are notoriously bad at understanding risk, and perceptions of danger can vary wildly depending on how the possibility is presented. (David Spiegelhalter recently published an excellent review...

...

Forrester Blogs

Video Platform Convergence Will Benefit Your Customers And Employees

Video is about 1,800 times more powerful than words. It’s the medium that creates the best emotional connection with both your customers and your employees. With the convergence of internal and...

...

Forrester Blogs

Join Our 2017 Global Mobile Executive Survey

For the past few years, Forrester has fielded a Global Mobile Executive Survey to understand and benchmark mobile initiatives. This year, we are updating the survey again to help marketers and...

...
Ronald van Loon

RELX Group: The Transformation to a Leading Global Information & Analytics Company

When we talk about taking the changes in technology and implementing them within an organisation, one name jumps to mind – RELX Group. The transformation of the FTSE 100 (and FTSE 15) RELX Group from a media company to leading global information and analytics company, with a market capitalization of about $44bn, is indeed inspirational and somewhat surprising.

With a heritage in publishing, RELX Group has now successfully transformed its revenue streams. Over the past decade, print sales have been managed down from 50% to just 10% and the vast majority of revenues are now derived from digital. The company spends $1.3bn on technology annually and employs 7,000 technologists across the company’s four global divisions. Notably, MSCI re-categorized RELX as a business services company rather than a media group last year.

I recently had the pleasure of holding an interview with Kumsal Bayazit, Chairwoman, RELX Technology Forum at RELX Group. Ms Bayazit has been at the group for more than 14 years now and played a major role in devising the pathway that dictated the company’s transformation during the last decade.

The Transformation

Every transformation within an organization requires firm belief and perseverance. Without any one of these factors, the transformation will either be left void or will propel the organization backwards. Thus, the successful transformation for RELX was no different. When asked about her thoughts regarding the vision behind the transformation, Bayazit responded that,

“We created the transformation through our strategy, which focuses on providing our customers with electronic decision tools to help them make better decisions, get better results, and help them be more productive.”

She then explained that RELX’s rich heritage in terms of having comprehensive data sets played the important role of a lubricant in the transformation process. And over the last two decades, it had complemented its rich assets by building an equally deep capability in analytics and technology.

Focusing on the transformation, Bayazit had a lot to say about what initiated the transformation for RELX Group.

“It started with a couple of things. First, we could see with the impact of technology that the jobs of our customers were changing, whether those were lawyers, authors, scientists or risk professionals. As the Internet evolved, digital evolved, technology evolved, the way they wanted to interact with information and content – changed. And because we have a deep relationship with our customers, we could see their needs were changing and then the logical question was: How do we better serve them?”

She outlined the need to move faster, in a better way and more cost effectively. Recognizing the changing trends of the market, RELX Group started sub-segmenting customer groups. For example, RELX’s LexisNexis legal business kept deepening the customer segmentation to identify specific categories from litigators to malpractice litigators to medical malpractice litigators and so on. Once segmented, the LexisNexis teams focused on the precise needs of the specific group and discovered the way customers were interacting with content along with data was changing.

The changing customer interaction with data prompted RELX to review its strategy and formulate a plan for customers on a micro level. Thus, RELX could now understand what the different segments wanted from them and how they could adapt to the changing needs and present the customer segments with just what they wanted.

The strategy for RELX is to try and adapt. Bayazit said:

“It is not an easy journey. You have to test and try and adjust, test and try and adjust. That is how we built our strategy and we continue to refine it. If something does not work, we go back and change it.”

The Four Pillars

The four primary capabilities that RELX Group focuses on are:

  1. Deep understanding of customers
  2. Leading content and data sets
  3. Sophisticated analytics
  4. Powerful technology

Plenty of companies will be able to do any one of these four things at one time, but RELX Group prides itself on the fact that it can manage and handle all four of these pillars at the same time. Deep understanding of customers has always been in the DNA of the group. This is the way it has operated and plans to operate in the future.

According to Bayazit, RELX Group is currently well organized when it comes to customer segmentation. Sub-sets of customers are catered to by an individual group of commercial product and technology teams that understand the dynamics of the specific group and offer them new innovations accordingly. The teams also reflect the passions of the professions they work for. So, let’s say if RELX Group’s Elsevier business has a team that is working with science professionals, the team members would have plenty of passion and expertise in those areas. The product groups at RELX Group also live and breathe the changing trends and requirements across each of these sub-segments. They are in continuous coordination with the technology group, so all the needs are communicated appropriately and promptly, further helping the company remain organized and streamlined for the future. Bayazit summed up the structure and how it combines to assist the completion of the four pillars.

“To sum up, we have a product team and their job is to understand customer segment needs and then translate them into an offering that meets their needs. These people are more on the business side, so they are commercially oriented, but they are relatively tech savvy as well. And that is important because you really want your technologists and business people to be able to communicate smoothly. Then we have the technology team that has all the engineering folks, coders, architects, and data scientists. Architects will look into all the evolving technologies like machine learning and understand what the capabilities of these technologies are. Then they choose which of those emerging technologies would be most effective if deployed to tie in with the needs and offering that the product team has worked on.”

The Role of Data and Analytics

The role of data and analytics is highlighted in RELX Group’s second pillar of capabilities. The content and data sets, on which their transformational strategy rests, is based on data acquired over many years. These data sets are incredibly comprehensive and unique. RELX Group has petabytes of data under its control, according to Bayazit. The company will build data, acquire it, license it, aggregate it, and then build contributed databases where their customers can contribute.

Kumsal Bayazit recognizes the plethora of data the company controls, referencing some of the figures that illustrate the data and analytical power of RELX Group.

“These raw data sets are then put together and refined to help businesses with their decision making. Just to give you a sense of data – RELX Group’s LexisNexis Risk Solutions business has 6 petabytes of data including 65 billion items of information collected from over 10,000 sources and 30 million business entities. In the insurance sector, we have over 2 billion miles of driving behavior data. LexisNexis legal business has 3 petabytes of data available. Elsevier has the most expansive citation and abstract database with over 65 million records. Science Direct, which is one of our primary data and content distribution platforms accessible to scientists, has over 14 million content pieces alone. So, that in itself is a rich data set for us to mine.”

The interview was a learning curve for me, as I could listen to one of the most unbelievable transformation stories of this decade. You can learn more about the transformation at RELX by visiting their site www.relx.com .

About the Author

Ronald van Loon is an Advisory Board Member and Big Data & Analytics course advisor for Simplilearn. He contributes his expertise towards the rapid growth of Simplilearn’s popular Big Data & Analytics category.

If you would like to read more from Ronald van Loon on the possibilities of Big Data and Artificial Intelligence please click “Follow” and connect on LinkedIn and Twitter.

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at a Data Consultancy company, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post RELX Group: The Transformation to a Leading Global Information & Analytics Company appeared first on Ronald van Loons.


Forrester Blogs

A New Age For Smart Cities: Q&A

Last week I presented a webinar “A New Age for Smart Cities” as part of Philips’ Lighting University. Many cities have invested in new lighting initiatives replacing traditional lights with LED...

...
 

October 05, 2017


Forrester Blogs

Watch Out Translators, Google Wants Your Job!

Google after second shot disrupting healthcare-and Apple-with Health 2.0 announcement  Google made several announcements Wednesday around new hardware, software, and developer tools. Capitalizing on...

...

Forrester Blogs

Q&A: Digital Experience Platform Wave 2017

Two weeks ago my colleague Ted Schadler and I published our Wave on the Digital Experience Platform market (link) and an accompanying trends report (link). We’re grateful to all those** who...

...

Forrester Blogs

OTT Won’t Replace Traditional TV — The Two Will Get Married And Have Babies

Recent headlines have covered cord cutting as if it is an inevitable trend, set to doom the TV industry as we know it. However, when we peel back the layers of the alarmist headlines, the argument...

...

Forrester Blogs

In A Brave New World Of Viewing, Consumers Need A New TV Guide

Streaming behavior continues to increase among consumers, and TV viewing – whether it is through live TV, DVRs, or OTT devices – is a critical part of the viewing mix. With the recent surge in...

...

Revolution Analytics

In case you missed it: September 2017 roundup

In case you missed them, here are some articles from September of particular interest to R users. The mathpix package converts images of hand-drawn equations to their LaTeX equivalent. R 3.4.2 is...

...
Silicon Valley Data Science

Data Opportunities in Health Care

No one group in the health care industry can create a complete patient picture. Rather, insurance companies, drug companies, doctor offices, etc., each get snapshots of the patient at fixed points in time. Having one place in which to compile all patient data would be tremendously useful, but the lack of it shouldn’t stop you from using the data you do have.

While the specifics may look a bit different from company to company, this post contains high-level observations on how your health care organization can start with what you know about patients to build better, stronger patient relationships. For further thoughts on understanding customer engagement in product companies, see our Data-Driven User Engagement post.

Becoming patient-centric through data

It’s tempting to believe that shipping your data off to a firm for model-building, or uploading it to a product, will expose amazing insights. To truly be successful, though, you need to be intentional about the insights you’re after, and start by identifying your business objectives. With those in hand, think about the data you actually need to achieve those objectives.

How can you get from the data you have, to the data you need? The following observations will give you a place to start so you can use your data to drive patient engagement.

Establish patient engagement metrics

Before trying to “boil the ocean” and unlock all aspects of the patient experience, determine which specific results you most want to improve. Questions may include:

  • Who is the “patient” or consumer of your health care offering?
  • What is “success,” and how will you measure it?
  • How does your role in the continuum of care position you to achieve those results?

Once you have addressed questions like these, you need understand if the data supports your definitions and metrics in order to to quantify outcomes. Performing Exploratory Data Analysis (EDA), can help you evaluate the feasibility of your targets and develop an appreciation for the data you have as well as requirements for what you may need. EDA brings together samples of your patient related data and uses visual and quantitative methods to understand and summarize a dataset, without making any assumptions about its contents.

Understand patient segments

If you’re having trouble answering that first question—who is the “patient” or consumer of your health care offering—you can use your data to perform segmentation, and define different groups of patients with different sets of needs. This will allow you to create a more personalized experience for each segment of your patients. Here are some specific examples of things to try:

  • apply historical models and insights to new patients in order to quickly put them into segments to better manage their experience
  • test and iterate on programs targeted to specific segments
  • create opportunities for personalized health care, based on unique patient and segment characteristics
  • understand patient migration between segments, including leading indicators and remediation strategies
  • map channel and engagement behaviors since some demographics may “customize” how they manage their care beyond what is assumed

Engage the larger ecosystem

Each participant in the health care ecosystem is trying to optimize their services and outcomes around what they actually know; then they purchase, infer, or predict information to get at what they do not. Over the last few years, many organizations have been carefully developing data-sharing partnerships for very specific uses while keeping their competitive advantages close.

There are distinct opportunities for anyone aggregating health-related data. EDA will not only reveal any gaps in your own data, but will also help you understand the value of the data you create through your business operations. Think about how it can be used beyond its primary transaction or interaction purposes, and consider who else might find it valuable.

Next, explore who has puzzle pieces of data that would enrich your own data for better patient engagement. How might you share that data and what could you get in return?

Detect and predict patterns to drive adherence

In order to achieve the desired outcomes, a patient must adhere to their prescribed treatment regimen. Pattern detection can be used to understand the patient’s behaviors in following the desired protocols. Once you understand patterns, you can assess the challenges a patient may be encountering and potentially predict the path of their care through your offering.

Is the patient at risk of going off therapy? Is he showing signs of dealing with complications or comorbidities? Is she risking a lapse in coverage? Or are you simply seeing seasonality, such as the patient taking a break over the holidays?

Applying techniques such as machine learning to your patient related data can help you detect these patterns and predict potential outcomes—including who or what may be influencing the patient, the degree of the influence, and what types of intervention may lead to better adherence.

What next?

To truly make the most of your patient data, you need to first establish your business goals. Once those are in place, and agreed upon among your various stakeholders, look at the data you have available. Map that data to those business goals and identify any gaps that might exist. From there, you can prioritize, knowing that you are being deliberate in your choices.

We know what it takes to become data-driven, and have experience in the health care industrycontact us to talk more about how we can help you take advantage of your patient data.

The post Data Opportunities in Health Care appeared first on Silicon Valley Data Science.


Forrester Blogs

Google Signals Strong Commitment To Consumer Experiences With New Devices Launched

Google has dabbled in hardware on and off for years. Yesterday’s announcements – new smartphones, headphones, wireless speakers and more – signal a strong commitment to being a key player in...

...

Forrester Blogs

Human + Machine: The Robot Revolution Requires Human Design

Robots – in the form of artificial intelligence, software bots, intelligent assistants, customer self-service solutions, and, yes, physical robots – are transforming our economy, our jobs, and how we...

...

Forrester Blogs

CSI: Your Network – Reconstructing the Breach

September 2017 was a busy month. Three major breach notifications in Deloitte, the SEC, and Equifax… and my first Wave dropped, coincidentally on Digital Forensics & Incident Response...

...
 

October 04, 2017


Revolution Analytics

Introducing the Deep Learning Virtual Machine on Azure

A new member has just joined the family of Data Science Virtual Machines on Azure: The Deep Learning Virtual Machine. Like other DSVMs in the family, the Deep Learning VM is a pre-configured...

...

Forrester Blogs

The US Health Insurance Customer Experience Index, 2017

Health Insurers’ CX Leaves Room For Improvement, For Customers And Revenue Today, only two things predict whether someone will buy your insurance or your competitor’s: 1) is my doctor covered?...

...

Forrester Blogs

Office Depot Acquires CompuCom — Is The Third Time The Charm For Retail Diving Into IT Services?

As Yogi Berra said, “It’s déjà vu all over again.” Office Depot announced the $1 billion acquisition of CompuCom yesterday, which is a strong pivot into IT services for the 31-year-old...

...

Revolution Analytics

A cRossword about R

The members of the R Ladies DC user group put together an R-themed crossword for a recent networking event. It's a fun way to test out your R knowledge. (Click to enlarge, or download a printable...

...
 

October 03, 2017


Revolution Analytics

Create Powerpoint presentations from R with the OfficeR package

For many of us data scientists, whatever the tools we use to conduct research or perform an analysis, our superiors are going to want the results as a Microsoft Office document. Most likely it's a...

...
decor