Within the CS-AWARE project, 3rdPlace is committed in implementing a data collection framework capable of storing in a real time fashion a large stream of logs coming from systems, applications, databases and network devices.

After an initial analysis phase carried out in those first months alternative solutions have been identified. This post briefly describe one of the most interesting solution identified based on BigQuery and Apache Beams.

Apache Beam is an open source solution specifically designed to model parallel processing pipelines for streaming data. By using Beam SDK it becomes possible defining the streaming pipeline that could be executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Beam is particularly useful for tackle data processing tasks in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel.

If Google Cloud Dataflow is employed as distributed processing back-ends, than pre-processed data could be stored in BigQuery which is a big-data high performing non-relational database. Some of the advantages of BigQuery are to be highly scalable, low cost, serverless and it does supports SQL like syntax. Furthermore, BigQuery offers a high-speed streaming insertion API with different plugins for real-time analytics.

To conclude, so far this solution looks promising in both costs and performances aspects; additionally it appears not too onerous to implement.

Matteo Bregonzio