What is Data Stream in Data Mining

February 5, 2018 Author: virendra
Print Friendly, PDF & Email

In today’s information society, computer users are used to gathering and sharing data anytime and anywhere. This concerns applications such as social networks, banking, telecommunication, health care, research, and entertainment, among others. As a result, a huge amount of data related to all human activity is gathered for storage and processing purposes. These data sets may contain interesting and useful knowledge represented by hidden patterns, but due to the volume of the gathered data it is impossible to manually extract that knowledge. Data streaming requires some combination of bandwidth sufficiency and, for real-time human perception of the data, the ability to make sure that enough data is being continuously received without any noticeable time lag.

What is it?

Streaming Data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously, and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, e-commerce purchases, in-game player activity, information from social networks, financial trading floors, or Geo-spatial services, and telemetry from connected devices or instrumentation in data centers. This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows, and used for a wide variety of analytics including correlations, aggregations, filtering, and sampling. Information derived from such analysis gives companies visibility into many aspects of their business and customer activity such as –service usage (for metering/billing), server activity, website clicks, and Geo-location of devices, people, and physical goods –and enables them to respond promptly to emerging situations. For example, businesses can track changes in public sentiment on their brands and products by continuously analyzing social media streams, and respond in a timely fashion as the necessity arises.



Stream Data Processing

Stream data processing refers to the technology to process stream data in real time. Stream data means the data that is created continuously and has the time stamp (information on the time when the data is created or updated), such as stock trading information and traffic information.

The following figure shows the components used in stream data processing.

Data Streaming Architecture

Figure 1:   Data Streaming Architecture

The following subsections describe the components shown in the figure:

  • Adaptors(input adaptor and output adaptor)

The input adaptor converts input data to a format that can be processed by the stream data processing engine and then sends the data to the stream data processing engine. The output adaptor receives the data processed by the stream data processing engine, converts it to a specified format, and then outputs the data.The two types of adaptors are the standard adaptors provided by Stream Data Platform – AF (standard adaptors) and adaptors created by the user using Java programming (custom adaptors). Standard adaptors function as a group called an adaptor group. For the custom adaptors, an input adaptor is called a data transmission application and an output adaptor is called a data reception application.

  • Stream data processing engine

The stream data processing engine processes the data sent from an input adaptor in accordance with a pre-registered query. The processed data is then sent to an output adaptor.
A server process that runs on the stream data processing engine and processes stream data is called an SDP server. The adaptors and the SDP server can all operate in the same process, or they can operate in separate processes. In the figure above, the adaptors and the SDP server are operating in separate processes

  • Query group

A query group constitutes the summary analysis scenarios for stream data. In the stream data processing system, the summary analysis scenarios for stream data, such as the types of data to be input and how that data is to be processed and output, are defined as query groups.
A query group consists of the following elements:

  • Input stream queue(input stream)
  • Query
  • Output stream queue(output stream)

A query is a definition that specifies how stream data is to be processed. Stream data is sent to a query via an input stream queue. The result of query processing is converted to stream data and then output via an output stream queue. There is a one-to-one correspondence between a stream queue and a stream.

Stream Data Types and Characteristics

There are two kinds of data streams, transactional data streams and measurement data streams:

Transactional data streams: These are the data streams which recorded the interaction between data attributes, for example; purchasing details in credit card, phone calls lists of callers to dialed parties and access records by client on the server.

Measurement data streams: Consists of data from monitors (or sensors) on entities of interest. These are the data streams which monitored the changes of entity states, for example; traffic details at router interfaces, weather forecasts at weather stations and road traffic in sensor networks.



The main data stream characteristics and their implications are:

  • Data items arrive continuously and sequentially as a stream and are usually ordered by a timestamp value or other attribute values of the data item. Therefore, data items belonging to the same data stream are usually processed in the order they arrive
  • Data streams are usually generated by external sources or other applications and are sent to a Data Stream Management system (DSMS); typically DSMSs do not have direct access or control over data sources.
  • The input characteristics of a data stream are usually not controllable and they are typically unpredictable.

Data items in a data stream are not error-free because the data sources are external some data items may be corrupted or discarded due to network/ transmission problems.

References





[1] Hahsler, Michael, Matthew Bolanos, and John Forrest. “Introduction to stream: An Extensible Framework for Data Stream Clustering Research with R.”

[2] “What is Streaming Data?” available online at: https://aws.amazon.com/streaming-data/

[3] “Components of a stream data processing system”, available online at: http://itdoc.hitachi.co.jp/manuals/3020/30203V0200e/BV020011.HTM

[4] Melody Ku Man Ying, “Data stream filtering in handheld devices”, Literature Review – 2004.

[5] Qingchun Jiang, “A Framework for Supporting Quality of Service Requirements in a Data Stream Management System”, Technical Report CSE-2005-8 PhD Dissertation.

No Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert