Machine Learning Forecasting with Elasticsearch, Elastic Stack (ELK Stack)
Articles,  Blog

Machine Learning Forecasting with Elasticsearch, Elastic Stack (ELK Stack)


>>SHAY BANON: The company has built an amazing
anomaly detection engine, and we’ve taken that, integrated it into our stack, it is
a part of or a feature of our stack.  And we’ve called this team, and we have
called this feature machine learning, even though machine learning is such a huge topic. And the idea is, we try to convey to you,
our users, that we want to go more, and we want to go way beyond just anomaly detection. Our goal is to slowly start to address other
aspects in the machine learning space that is admittedly extremely huge.  And one of the things that we’ve done, and
it is one of the obvious ones I would say we want to do is once we can loop back in
time and detect anomalies, and based on the models that we’ve built as a result of it,
maybe we can use the model to look into the future and try to forecast what is going on
in a system. For example, is a disk going to run out of
space in one of your data centers? Is a certain KPI going to start to trend up
and to the right, or down into the left  And in order to see what we’ve done over
the past year, I would like to welcome Steve Dodson on stage.  Steve?>>STEVE DODSON: Thank you, I’m Steve, the
tech lead of the machine learning group, I would like to do a quick demo of the machine
learning features in Elastic Stack.  And to start with the data I’m going to
analyze is the New York taxi data, as you see, each record here represents a taxi trip.  So it has the pick up and drop off location,
the total distance, the cost, and so on. What we would like to do is analyze this data
and use the machine learning capabilities to get some insights out of to the data. And so if I go to the machine learning tab,
and click on this, what I can do is I can create a job. And, what we’re doing when we create a job
is simply taking data that is stored in Elasticsearch and pushing it through the machine learning
components in the X-Pack plugin, and that’s creating models and identifying anomalies,
and I’ll also be showing some examples of forecasting using and the same technology
later on as well.  So I will start with the New York data,
and I’m going to initially perform a very simple job here, which is going to be looking
at the overall rates, rate of pick ups. And I’m going to put this in a particular
group, I will explain why I’m doing that in a minute.  
So, effectively, what we have here is the data analyzed.  And, back here, we see that there’s an anomaly
on the data, and this anomaly here is actually on the 14th of March.  And if you look at what happened in New
York on the 14th of March, you know, there was a low volume of traffic there because
there was a lot of snow on the roads, there was a lot less taxis being picked up there.  There is another feature I was going to
demo, I did not put job groups in there, scheduled events which  allows us to treat certain
events differently, so here there’s actually daylight savings time, we will treat that
differently to the rest of the data, and that’s something that we will demoing that on Thursday
when we have a bigger session.  And now, onto forecasting.  So I have done, sort of, the basic anomaly
detection that we demoed last year, and now I would like to show — we will forecast and
actually predict where these taxis are going to go in the next week, or over the next couple
of weeks.  So in terms of the use case here, we are
going to look at — we’re going to look at, you know, duration of travel. So using this data, we can say, well, using
the historical data, can we create a baseline in terms of the typical times of going from
midtown to JFK, and looking at that on a day-to-day basis, and hourly basis, and predict typically
how long the travel time will be. And we are not building a root planning app
here, when you are doing something with root planning and accounting for traffic and adding
weights to the roads and so on, there’s obviously a separate problem here.  But hopefully, I will be able to show you,
using the historical data and using that to baseline our predictions gives us pretty good
results.  So, I’m going to go to the midtown to airport
journeys, and in this case I am going  look at the average duration. And I am going to run this not to the end
of the data, but for two weeks before the end of the data, you will see why shortly. I will put this into a job group. Okay. And, again, I’m running this, and there’s
30 million records running, we are aggregating the results, pushing it through the machine
learning components, we’re calculating the probability of the behavior based on what
we’ve seen historically and we are updating the models since this data is being streamed
through.  And we will view the results, you know,
there’s a forecast button here. And we added this to the 6.1 in the product
back in January, and what I can do is I can use this to forecast, you know, where in the
next two weeks is this signal likely to go? And you will see that despite the data being
reasonably sparse here, you know, we can actually create a reasonable baseline of where the
data is going to go on a daily basis, and, as Shay said, this is really the main use
case is more in the operations environment, where we are looking for trends in terms of,
you know, is your disk going to run out, are you running out of resources on your machines,
or whatever, or what capacity do I need in the next couple weeks based on current behavior? And the question is, how good is the prediction
we’ve made there, how good is it going to be, if the data keeps running through there,
how good is it and how can we compare to other methods in terms of evaluating its effectiveness.  So what I can go back to the job, and I
can actually just run the job for the last two weeks of data, and then compare the forecast
to those two weeks of data. Okay. Here is the forecast, the green line, and
I’m color-blind, so it is difficult for me to tell.  So the one that is not blue — (laughter)
— the smoother one. And the green one, sorry, the blue, well — (laughter)
— hopefully you get the idea. (Laughter)
But the prediction looks reasonable there and, as you can see, we will be talking about
this more, and there’s a special math session on what we’re doing behind here.  You can see that the variants in terms of
our bounds are growing, as we are predicting — as our predictions become less certain,
moving forward in time. So the other thing we can do is compare this
to things like Google Maps and Bing, because they have specific route planning things,
and here, we actually ran our data against Google Maps and Bing here as well.  And you can sort of overlay the whole lot
together here. And it actually looks pretty reasonable. I mean, obviously, you know, in terms of doing
this real-time with real events and so on, you know, on traffic situations, this is not
necessarily the best route planning thing.  But you can see here that, actually, our
predictions are pretty much in line with the actual data and, in particular, Google’s forecasts,
which seem pretty good. So, when we are going through this, we are
thinking, okay, what use cases can we use this for, what New York taxi things can we
think about? And I think, you know, one of the classic
New York taxi journeys is Die Hard III, Die Hard with a Vengeance, when Bruce Willis and
Samuel L Jackson have to get from 72nd and Main, sorry 72nd and Broadway to Wall Street,
and they set a time by Jeremy Irons, who’s playing Simon Gruber, the evil arch villain.  And he sets a time of 30 minutes. And the question is, is that a reasonable
time? Is that, like, if you are — if you are the
arch villain, would you set that time? Would I set 20, 30, 40, what is reasonable? (Laughter)
 The good thing is, with X-Pack and ML, you can work that out! (Laughter)
 So, what we are doing here is doing a very similar analysis, where we can go — I will
go back to the jobs, I pre-run this, because this is a short demo. And we can see here, here’s our predictions
for the — for that particular journey.  We have the parsed data in a smooth line,
and overlaying this on the actual data, and the Google data, here’s sort of the results.  And now, when we did this, we thought, well,
the challenge is actually — we knew it was 9:50, because the film makes it clear that
it is 9:50 in the morning when they start the journey, but we don’t know the month and
we didn’t know the day.  And that obviously affects things, weekends
behave differently than normal days, and after watching the film several times and reading
the screenplay, we found out it was set in July and there’s a very quick glimpse of a
calendar clock, as Samuel Jackson enters Wall Street, it says Tuesday.  So Tuesday, 9:50, we have the time here,
the red dot is Simon’s predicted times, and the others are predictions from Google and
ourselves.  And it is actually not that far off. So I think, you know, it is not — it is not,
you can see that it is not that — it is not dissimilar from a standard travel time.  So Bruce Willis going across Central Park,
tailgating an ambulance maybe was not required to get there in that time. (Laughter)
 So there is going to be plenty more, and as Shay says we are looking at time series
and doing other analyses as well.  And on Thursday at 11:30, myself and Sophie
will be going into more detail, but it would be really great to get your feedback and experiences
with the product this week.  
>>SHAY BANON: Thank you very much, Steve. (Applause)

Leave a Reply

Your email address will not be published. Required fields are marked *