[Developer's community]

Hard choice between pioneering and chasing a brighter future


28 November 2016

‘There is no present or future-only the past,

happening over and over again-now’

- Eugene O'Neill

I’ve been in the IT for 16 years and over the time I observed the same story – some companies try to either impose industry trends or to catch up with no success. Some ‘innovations’ doesn’t worth time spent on them, the other are real break-through. In this topic, I’m gonna talk about DevOps. There is nothing new in DevOps per se, and what we knew as a plain-old automation is now something more intelligent. In terms of business, the most important metrics are velocity, quality and time to market, therefore, methodology is an instrument – not the objective point.

Before we move forward, it worth to clarify, what is DevOps. It was said a lot, DevOps is a golden middle of Development and Operations etc., but in my opinion, it bridges the gap between technology innovation and business value. This is all about technology delivery to the business units (in terms of enterprise architecture) in timely fashion and ensuring it runs without interruption. In the official definition, there is nothing about rapid or continuous delivery/deployment (like in Agile), because it is related to automation and deep communication between relevant organizational units. In fact, it doesn’t matter what methodology your team or teams are practicing. At some point, it all come together to the stage of software release and communicating operational team for deployment and having not only user’s fast response but also operational team response and seamless integration is vital.

Embrace the future

None of the known methodologies can just be implemented. The worst mistake you can ever do is to blindly follow its practices, expecting immediate results. DevOps should be adopted and adaptation sometimes is a long, stressful and painful process.

The reason I recall Agile in this topic is because I heard too often ‘we already use Agile and continuous delivery, is this not the same to DevOps?’. Very often people (even on the executive level) confuse methodologies, their meaning and application areas. Aside from it, far not every organization wants go out of comfort zone, even if these measures are well-justified. Being change agent is a way not simple but very important role. You should be able to properly tie technology innovations, business value and the way it’s measured. The way a good change agent arguments DevOps for the business sounds not like: ‘we want to automate something for the sake of automation’. The right approach is – ‘we want to automate software delivery pipeline for the faster time to value and issues mitigation’. Overall, DevOps imposes performance-oriented culture in the teams, significantly improves communication, development cycle and release velocity, reduces recovery time and deployment failures. So, DevOps’s call is to answer the questions on how to:

  • Automate build and testing processes
  • Deploy and configure environments and keep the consistent
  • Improve communication with the development team
  • Monitor production, to be aligned with defined metrics and business goals

This might be a bit tricky, considering variety of teams (or business units in terms of EA), used technologies, security/compliance models and methodologies. The right tool set plays a key role in the process.

The process and tooling

The tools provide automation. Automation eliminates mistakes and waste. Automated release flow not only ensures consistent versions across assemblies/builds and deployments (using automated tests) but also makes a whole process more organized and predictable. Every change is traceable and reversible.

I tend to divide tool by relevant categories:

  • Source code management (TFS, Git (GitLab, GitHub, BitBucket), Mercurial, SVN)
  • CI Tools (TFS, VS Team Services, Jenkins, TeamCity, CruiseControl)
  • Monitoring and analytics (Nagios, Zenoss, Zabbix, MS SCOM, Splunk, DataDog, Scout)
  • Incident management (Jira, Service Desk+, Desk.com, Gemini)
  • Project tracking (Jira, TFS, VS Team Services, Gemini, Trello, Bitrix24, Producteev)

Generally, the process looks like the one below:

Develop -> Build (+ Unit/Integration tests) -> Deploy -> Test (User acceptance) -> Promote (Upgrade)


What’s next?

In companies with implemented DevOps and relevant development culture you can observe better work organization on both, personal and team levels. Despite the fact DevOps is treated as a methodology, it does not replaces Agile (in any sense), even though it’s taking over much of the enterprise process and definitely not because some executives likes to say we’re agile. Some Agile methods can be used as part of DevOps, which is becoming more popular these days in the enterprise world. The point is, both methodologies perfectly works together for the sake of faster delivery, continuous integration, user response, which should lead to the continuous improvements, not only on the project but also embedding the organization level (methodology successfully applied at one business unit can be moved to another, with minimal amendments, embracing the enterprise).

Do not afraid of experiments, let the teams try, fail and learn. Do not micro-manage (but facilitate) the team(s), let them take the responsibility.

Remember, in the end, you don’t try, you can’t fail, but is this really you want to do considering tough competition?

SAP ABAP in Eclipse


3 February 2014
  The Best-Run Business Run SAP



 First of all, just before using Eclipse for ABAP development we have to perform some necessary prerequisites.

1. Check whether Java Runtime Environment (JRE) 1.6 or higher, 32-Bit (for the 32-Bit version of ABAP Development Tools) or 64-Bit (for the 64-Bit version of ABAP Development Tools) is installed on your local drive. Download it right from Oracle website (ver. 1.7).

2. After Java installation, you're ready to download and setup Eclipse IDE. It doesn't matter what Eclipse package to choose (Juno or Kepler) and you can download it right from the official website.

3. Download SAP Development Tools for Eclipse.

    - In Eclipse, choose in the menu bar Help > Install New Software...

    - In the Install dialog, enter the following update site URL in the Work with entry field:

    - https://tools.hana.ondemand.com/kepler for Eclipse Kepler (4.3)

    - https://tools.hana.ondemand.com/juno for Eclipse Juno (4.2)

    - Press Enter to display the available features.

    - Select ABAP Development Tools for SAP NetWeaver and choose Next.

    - On the next wizard page, you get an overview of the features to be installed. Choose Next.

    - Confirm the license agreements and choose Finish to start the installation.


Now we're ready to go!

Open Eclipse and create your first project. Go to menu: File -> New -> ABAP Project. A new window will appear:

Create a new connection to SAP system (or choose from existing ones). I don't consider connection creation process in details and if you don't know how to do that, please refer to SAP documentation or ask on our forum...

Note: I've deliberately skipped an SNC-name creation for the connection as I have my own SAP installation locally and there is no reason to protect it.

So! I suppose you have successfully created a new project and ready to go further. Create a new ABAP program: right-click on the project name -> New -> ABAP Program. Give it a reasonable name (according to the SAP naming convention, the user-defined module has to start from letter Z, ie "ZTEST001" or whatever you like).

Key some source code in order to check how it works. We'll create a small program which selects flight numbers from the database:

report ztest0001.
*Data declaration
tables sflight.
data: begin of t_report occurs 3,
carrid like sflight-carrid,
connid like sflight-connid,
fldate like sflight-fldate,
seatsmax like sflight-seatsmax,
end of t_report.

*Selection screen
select-options s_carrid for sflight-carrid.
select-options s_connid for sflight-connid.
select-options s_date for sflight-fldate.

*Get data
SELECT * from sflight
where carrid in s_carrid and
connid in s_connid and
fldate in s_date.

t_report-carrid = sflight-carrid.
t_report-connid = sflight-connid.
t_report-fldate = sflight-fldate.
t_report-seatsmax = sflight-seatsmax.
append t_report.

if sy-subrc ne 0.
write 'Data not found'.

*Write data
loop at t_report.
skip. "comment:Go to next line
write t_report-carrid.
write t_report-connid.
write t_report-fldate.
write t_report-seatsmax.

 We are ready to run a project. Press CTRL+F11, familiar SAP window will appear. Press F8 to execute a program and look at the results window:

Horray! You've just finished your first journey to ABAP development with Eclipse... I hope it was the enjoyable and informative topic.

Next time we'll proceed with more complicated ABAP development and see how deep rabbit hole is ))


Please be aware that topics re-publishing from this website (in any kind) are only possible with author's permit.




13 May 2013

In the nutshell, Hadoop is an open-source project of the Apache Software Foundation that can be installed on a set of standard machines, so that these machines can communicate and work together to store and process large datasets. Hadoop has become very successful in recent years thanks to its ability to effectively crunch big data. It allows companies to store all of their data in one system and perform analysis on this data that would be otherwise impossible or very expensive to do with traditional solutions.

Many companion tools built around Hadoop offer a wide variety of processing techniques. Integration with ancillary systems and utilities is excellent, making real-world work with Hadoop easier and more productive. These tools together form the Hadoop Ecosystem.

Visit ttp://hadoop.apache.org to get more information about the project and access detailed documentation.

Note: By a standard machine, we mean typical servers that are available from many vendors and have components that are expected to fail and be replaced on a regular base. Because Hadoop scales nicely and provides many fault-tolerance mechanisms, you do not need to break the bank to purchase expensive top-end servers to minimize the risk of hardware failure and increase storage capacity and processing power.

Design Concepts

To solve the challenge of processing and storing large datasets, Hadoop was built according to the following core characteristics:

  • Distribution - instead of building one big supercomputer, storage and processing are spread across a cluster of smaller machines that communicate and work together.
  • Horizontal scalability - it is easy to extend a Hadoop cluster by just adding new machines. Every new machine increases total storage and processing power of the Hadoop cluster.
  • Fault-tolerance - Hadoop continues to operate even when a few hardware or software components fail to work properly. Cost-optimization - Hadoop runs on standard hardware; it does not require expensive servers.
  • Programming abstraction - Hadoop takes care of all messy details related to distributed computing. Thanks to a high-level API, users can focus on implementing business logic that solves their real-world problems.
  • Data locality – don’t move large datasets to where application is running, but run the application where the data already is.

Hadoop Components

Hadoop is divided into two core components

  • HDFS - a distributed file system;
  • YARN - a cluster resource management technology.

Many execution frameworks run on top of YARN, each tuned for a specific use-case. The most important are discussed under ‘YARN Applications’ below.

Let’s take a closer look on their architecture and describe how they cooperate.

Note: YARN is the new framework that replaces the former implementation of the processing layer in Hadoop. You can find how YARN addresses shortcomings of previous version on the Yahoo blog.


HDFS is a Hadoop distributed file system. It can be installed on commodity servers and run on as many servers as you need - HDFS easily scales to thousands of nodes and petabytes of data.

The larger HDFS setup is, the bigger probability that some disks, servers or network switches will fail. HDFS survives these types of failures by replicating data on multiple servers. HDFS automatically detects that a given component has failed and takes necessary recovery actions that happen transparently to the user.

HDFS is designed for storing large files of the magnitude of hundreds of megabytes or gigabytes and provides high-throughput streaming data access to them. Last but not least, HDFS supports the write-once-read-many model. For this use case HDFS works like a charm. If you need, however, to store a large number of small files with a random read-write access, then other systems like RDBMS and Apache HBase can do a better job.

Note: HDFS does not allow you to modify a file’s content. There is only support for appending data at the end of the file. However, Hadoop was designed with HDFS to be one of many pluggable storage options – for example, with MapR-Fs, a proprietary filesystem, files are fully read-write. Other HDFS alternatives include Amazon S3 and IBM GPFS.

Architecture of HDFS

HDFS consists of following daemons that are installed and run on selected cluster nodes:

  • NameNode - the master process responsible for managing the file system namespace (filenames, permissions and ownership, last modification date etc.) and controlling access to data stored in HDFS. It is the one place where there is a full overview of the distributed file system. If the NameNode is down, you can not access your data. If your namespace is permanently lost, you’ve essentially lost all of your data!
  • DataNodes - slave processes that take care of storing and serving data. A DataNode is installed on each worker node in the cluster.

Figure 1 illustrates installation of HDFS on a 4-node cluster. One of the node hosts the NameNode daemon while the other three run DataNode daemons.

Figure 1. HDFS on a 4-node cluster

  • Note: NameNode and DataNode are Java processes that run on top of Linux distribution such as RedHat, Centos, Ubuntu and more. They use local disks for storing HDFS data.
  • HDFS splits each file into a sequence of smaller, but still large, blocks (default block size equals to 128MB – bigger blocks mean fewer disk seek operations, which results in large throughput). Each block is stored redundantly on multiple DataNodes for fault-tolerance. The block itself does not know which file it belongs to – this information is only maintained by the NameNode that has a global picture of all directories, files and blocks in HDFS.
  • Figure 2 illustrates the concept of splitting files into blocks. File X is split into blocks B1 and B2 and File Y comprises of only one block B3. All blocks are replicated 2 times within the cluster. As mentioned, information about which blocks compose a file is kept by the NameNode while raw data is stored by DataNodes.

Figure 2. Concept of splitting files into blocks

Interacting with HDFS

HDFS provides a simple POSIX-like interface to work with data. You perform file system operations using hdfs dfs command.

To start playing with Hadoop you don’t have to go through process of setting up a whole cluster. Hadoop can run in co-called pseudo-distributed mode on a single machine. You can download the sandbox Virtual Machine with all HDFS components already installed and start using Hadoop in no time! Just follow one of these links:

§  http://www.mapr.com/products/mapr-sandbox-hadoop
§  http://hortonworks.com/products/hortonworks-sandbox/#install
§  http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-5-1-x1.html
§  http://www.cloudera.com/content/support/en/downloads/quickstart_vms/cdh-5-1-x1.html


The following steps illustrate typical operations that a HDFS user can perform:

1.    List the content of home directory

$ hdfs dfs -ls /user/adam

2.    Upload a file from local file system to HDFS

$ hdfs dfs -put songs.txt /user/adam

3.    Read the content of the file from HDFS

$ hdfs dfs -cat /user/adam/songs.txt

4.    Change the permission of a file

$ hdfs dfs -chmod 700 /user/adam/songs.txt

5.    Set the replication factor of a file to 4

$ hdfs dfs -setrep -w 4 /user/adam/songs.txt

6.    Check the size of the file

$ hdfs dfs -du -h /user/adam/songs.txt

7.    Move the file to the newly created subdirectory

$ hdfs dfs -mv songs.txt songs/

8.    Remove directory from HDFS

$ hdfs dfs -rm -r songs

You can type hdfs dfs without any parameters to get a full list of available commands.



YARN is a framework that manages resources on the cluster and enables running various distributed applications that process data stored (usually) on HDFS.

YARN, similarly to HDFS, follows the master-slave design with single ResourceManager daemon and multiple NodeManagers daemons. These types of daemons have different responsibilities.


  • Keeps track of live NodeManagers and the amount of available compute resources that they currently have
  • Allocates available resources to applications submitted by clients
  • Monitors whether applications complete successfully


  • Offer computational resources in form of containers
  • Run various applications’ tasks inside the containers

YARN assigns cluster resources to various applications in the form of resource containers which represent a combination of resource elements such as memory and CPU.

Each application that executes on YARN cluster has its own ApplicationMaster process. This process starts when the application is scheduled on the cluster and coordinates the execution of all tasks within this application.

Each task runs within a container managed by the selected NodeManager. The ApplicationMaster negotiates resources (in form of containers) with the ResourceManager. On a successful negotiation, the ResourceManager delivers a container specification to the ApplicationMaster. This specification is then handed over to a NodeManager which launches the container and executes a task within it.

Figure 3 illustrates cooperation of YARN daemons on 4-node cluster running two applications that spawned 7 tasks in total.

Figure 3. Cooperation of YARN daemons on 4-node cluster


Hadoop 2.0 = HDFS + YARN

HDFS and YARN daemons running on the same cluster give us a powerful platform for storing and processing large datasets.

Interestingly, DataNode and NodeManager processes are collocated on the same nodes to enable one of the biggest advantages of Hadoop called data locality. Data locality allows us to perform computations on the machines that actually store the data, thus minimizing the necessity of sending large chunks of data over the network. This technique known as “sending computation to the data” causes significant performance improvements while processing large data.


Figure 4. Collocating HDFS and YARN daemons on a Hadoop cluster.


YARN Applications

YARN is merely a resource manager that knows how to allocate distributed compute resources to various applications running on a Hadoop cluster. In other words, YARN itself does not provide any processing logic that can analyze data in HDFS. Hence various processing frameworks must be integrated with YARN (by providing a specific implementation of the ApplicationMaster) to run on a Hadoop cluster and process data in HDFS.

The table below provides a list and short descriptions of the most popular distributed computation frameworks that can run on a Hadoop cluster powered by YARN:


The most popular processing framework for Hadoop that expresses computation as a series of map and reduce tasks. MapReduce is explained in the next section.

Apache Spark

A fast and general engine for large-scale data processing that optimizes the computation by aggressively caching data in memory.

Apache Tez

Generalizes the MapReduce paradigm to a more powerful and faster framework that executes computation as complex directed acyclic graphs of general data processing tasks.

Apache Giraph

An iterative graph processing framework for big data.

Apache Storm

A realtime stream processing engine.

Cloudera Impala

Fast SQL on Hadoop.



MapReduce is a programming model that allows for implementing parallel-distributed algorithms. To define computations in this paradigm you provide the logic for two functions: map() and reduce() that operate on <key, value> pairs.

Map function takes a <key, value> pair and produces zero or more intermediate <key, value> pairs:

   Map (k1, v1) -> list (k2, v2)

Reduce function takes a key and list of values associated with this key and produces zero or more final <key, value> pairs:

   Reduce (k2, list (v2)) -> list (k3, v3)

Between Map and Reduce functions all intermediate <key, value> pairs produced by Map functions are shuffled and sorted by key, so that all the values associated with the same key are grouped together and passed to the same Reduce function.

Figure 5. Key grouping


The general purpose of Map function is to transform or filter the input data. On the other hand Reduce function typically aggregates or summarizes the data produced by Map functions.

Figure 6 shows an example of using MapReduce to count occurrences of distinct words in a sentence. Map function splits the sentence and produces intermediate <key, value> pairs where a key is the word and a value equals to 1. Then reduce function sums all the 1-es associated with a given word returning the total number of occurrences of that word.

Figure 6. Using MapReduce to count occurrences of distinct words


MapReduce on YARN

MapReduce on YARN is a framework that enables running MapReduce jobs on the Hadoop cluster powered by YARN. It provides a high-level API for implementing custom Map and Reduce functions in various languages as well as the code-infrastructure needed to submit, run and monitor MapReduce jobs.

Note: MapReduce was historically the only programming model that you could use with Hadoop. It is no longer the case after the introduction of YARN. MapReduce is still the most popular application running on YARN clusters, though.

The execution of each MapReduce job is managed and coordinated by an instance of a special process called MapReduce Application Master (MR AM). MR AM spawns Map tasks that runs map() functions and Reduce tasks that run reduce() functions. Each Map task processes a separate subset of the input dataset (one block in HDFS by default). Each reduce task processes a separate subset of the intermediate data produced by the Map tasks. What’s more, Map and Reduce tasks run in isolation from one another, which allows for parallel and fault-tolerant computations.

To optimize the computation, MR AM tries to schedule data-local Map tasks. Such tasks execute in the containers running on the NodeManagers that are collocated with DataNodes that already store the data we want to process. Because by default each block in HDFS is redundantly stored on three DataNodes there are three NodeManagers that can be asked to run a given Map task locally.


Submitting a MapReduce Job

Let’s see MapReduce in action and run a MapReduce job on a Hadoop cluster.

To get started quickly we use a jar file with MapReduce examples that is supplied with Hadoop packages. On Linux systems it can be found under:


We run the Word Count job explained in the previous section.

1.     Create a file named hamlet.txt that has following content:

   ‘To be or not to be’

2.     Upload input data on HDFS

   # hdfs dfs -mkdir input

3.     Submit the WordCount MapReduce job to the cluster:

   # hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount 
input hamlet-output

After a successful submission, track the progress of this job on the ResourceManager web UI.

If you use a sandbox, the ResourceManager UI is available at http://localhost:8088

Figure 7: ResourceManager UI with running Job


4.     Check the output of this job in HDFS

   # hadoop fs -cat hamlet-output/*

Apart from the Word Count job, the jar file contains several other MapReduce examples. You can list them by typing the following command:

   # hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar

The table below provides a list and short descriptions of a couple of interesting MapReduce examples:


Counts the matches of a given regular expression in the input dataset.


Estimates Pi using a quasi-Monte Carlo method.


Sorts the input dataset. Often used in conjunction with teragen and teravalidate. Find more details here.


Counts the average length of the words in the input dataset.

Processing Frameworks

Developing applications in native MapReduce can be a time-consuming and daunting work reserved only for programmers.

Fortunately, there are a number of frameworks that make the process of implementing distributed computation on Hadoop cluster easy and quicker, even for non-developers. The most popular ones are Hive and Pig.


Hive provides a SQL-like language, called HiveQL, for easier analysis of data in Hadoop cluster. When using Hive our datasets in HDFS are represented as tables that have rows and columns. Therefore, Hive is easy to learn and appealing to use for those who already know SQL and have experience in working with relational databases.

Having this said, Hive can be considered as a data warehouse infrastructure built on top of Hadoop.

A Hive query is translated into a series of MapReduce jobs (or a Tez directed acyclic graph) that are subsequently executed on a Hadoop cluster.

Hive example

Let’s process a dataset about songs listened to by users in a given time. The input data consists of a tab-separated file ‘songs.txt’:

“Creep” Radiohead piotr 2013-01-20
“Desert Rose” Sting adam 2013-01-14
“Desert Rose” Sting piotr 2013-02-10 
“Karma Police” Radiohead adam 2013-01-23
“Everybody” Madonna piotr 2013-01-01
“Stupid Car” Radiohead adam 2013-01-18
“All This Time” Sting adam 2013-01-13 

We use Hive to find the two most popular artists in July 2013:

Note: We assume that commands below are executed as user “training”.

1.     Put songs.txt file on HDFS:

# hdfs dfs -mkdir songs

# hdfs dfs -put songs.txt songs/

2.     Enter hive:

# hive


3.     Create an external table in Hive that gives a schema to our data on HDFS:


 hive> CREATE TABLE songs(

 title STRING,

      artist STRING,

      user STRING,

      date DATE




      LOCATION ‘/user/training/songs’;

4.     Check if the table was created successfully:

  hive> SHOW tables;

5.     You can see also the table’s properties and columns:

Apart from information about column names and types, you can see other interesting properties:


# Detailed Table Information    

Database:             default             

Owner:                root                

CreateTime:           Tue Jul 29 14:08:49 PDT 2013

LastAccessTime:       UNKNOWN             

Protect Mode:         None                

Retention:            0                   

Location:            hdfs://localhost:8020/user/root/songs

Table Type:          EXTERNAL_TABLE

6.     Run a query that finds the two most popular artists in July 2013:

SELECT artist, COUNT(*) AS total

FROM songs

WHERE year(date) = 2013 AND month(date) = 7

GROUP BY artist




This query is translated to two MapReduce jobs. Verify it by reading the standard output log messages generated by a Hive client or by tracking jobs executed on Hadoop cluster using ResourceManager web UI.

Note: at the time of this writing, MapReduce was the default execution engine for Hive. It may change in the future. See next section for instructions how to set other execution engine for Hive.



Hive is not constrained to translate queries into MapReduce jobs only. You can also instruct Hive to express its queries using other distributed frameworks such as Apache Tez.

Tez is an efficient framework that executes computation in form of a DAG (directed acyclic graph) of tasks. With Tez, a complex Hive query can be expressed as a single Tez DAG rather than multiple MapReduce jobs. This way we do not introduce the overhead of launching multiple jobs and avoid the cost of storing data between jobs on HDFS what saves I/O.

To benefit from Tez’s fast response times, simply overwrite hive.execution.engine property and set it to tez.

Follow these steps to execute the Hive query from the previous section as a Tez application:

1.     Enter hive:

# hive

2.     Set execution engine to tez:

hive> SET hive.execution.engine=tez;

3.     Execute query from the Hive section:

Note: now you can see different logs displayed on the console than when executing the query on MapReduce:

Total Jobs = 1
Launching Job 1 out of 1
Status: Running application id: application_123123_0001
Map 1: -/-    Reducer 2: 0/1    Reducer 3: 0/1
Map 1: 0/1    Reducer 2: 0/1    Reducer 3: 0/1
Map 1: 1/1/   Reducer 2: 1/1    Reducer 3: 1/1
Status: Finished successfully
Radiohead 3
Sting 2

The query is now executed as only one Tez job instead of two MapReduce jobs as before. Tez isn’t tied to a strict MapReduce model - it can execute any sequence of tasks in a single job, for example Reduce tasks after Reduce tasks, what brings significant performance benefits.

Find out more about Tez on the blog: http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing.



Apache Pig is another popular framework for large-scale computations on Hadoop. Similarly to Hive, Pig allows you to implement computations in an easier, faster and less-verbose way than using MapReduce. Pig introduces a simple, yet powerful, scripting-like language called PigLatin. PigLatin supports many common and ready-to-use data operations like filtering, aggregating, sorting and joining. Developers can also implement own functions (UDFs) that extend Pig’s core functionality.

Like Hive queries, Pig scripts are translated to MapReduce jobs scheduled to run on Hadoop cluster.

We use Pig to find the most popular artists as we did with Hive in previous example.

1.     Save following script in top-artists.pig file

a = LOAD ‘songs/songs.txt’ as (title, artist, user, date);
b = FILTER a BY date MATCHES ‘2013-01-.*’;
c = GROUP b BY artist;
d = FOREACH c GENERATE group, COUNT(b) AS total;
e = ORDER d by total DESC;
f = LIMIT e 2;
STORE f INTO ‘top-artists-pig’;

2.     Execute pig script on Hadoop cluster:

# pig top-artists.pig

3.     Read the content of the output directory:

When developing Pig scripts you can iterate in local mode and catch mistakes before submitting jobs to the cluster. To enable local mode add -x local option to pig command.



Apache Hadoop is one of the most popular tools for big data processing thanks to its great features such as a high-level API, scalability, the ability to run on commodity hardware, fault tolerance and an open source nature. Hadoop has been successfully deployed in production by many companies for several years.

The Hadoop Ecosystem offers a variety of open-source tools for collecting, storing and processing data as well as cluster deployment, monitoring and data security. Thanks to this amazing ecosystem of tools, each company can now easily and relatively cheaply store and process a large amount of data in a distributed and highly scalable way.

Hadoop Ecosystem

This table contains names and short descriptions of the most useful and popular projects from the Hadoop Ecosystem that have not been mentioned yet:


Workflow scheduler system to manage Hadoop jobs.


Framework that enables highly reliable distributed coordination.


Tool for efficient transfer of bulk data between Hadoop and structured datastores such as relational databases.


Service for aggregating, collecting and moving large amounts of log data.


Non-relational, distributed database running on top of HDFS. It enables random realtime read/write access to your Big Data.


Additional Resources

·         http://hadoop.apache.org/
·         https://hive.apache.org/
·         http://pig.apache.org/
·         http://giraph.apache.org/
·         https://mahout.apache.org/
·         http://tez.apache.org/
·         https://spark.apache.org/
·         https://storm.incubator.apache.org/
Major packaged distributions:
BigData research guide:

BigDataResearchGuide_4.pdf (5,4MB)


Очистка WinSxS в Windows 7 и 8


15 February 2013

 В Windows 7 и Windows 8 появилась новая папка "WinSxS" находящаяся  в C:\Windows, в которой в основном хранятся файлы компонентов. Она также по совместительству является "свалкой" старых версий всех DLL библиотек и файлов компонентов и ее размер постоянно растет. В дополнение к этому, много места занимают папки резервного копирования, которые разрастаются до действительно больших объемов после установки Service Pack 1 для Windows 7. К примеру на моем ПК, с установленной Windows 7 Home Pro размер этой директории достигает почти 8Гб.


Это довольно много места, особенно для свежей установки обеих операционных систем. Как только вы установите обновления для ОС или сервис-пак, эта директория вырастет в размерх еще на несколько гигабайт. Как выражаются в Майкрософт, это супер-сет файлов для Windows (они требуются для стабильной работы ОС), поэтому не следует удалять эту директорию полностью (хотя это возожно). Однако, вы можетесэкономить немного дискового пространства.

Первое что вы можете сделать - это уменьшить размер бэкапа этой директории, посредством простой команды: Win+R, наберите cmd, появится командная строка, наберите 

dism /online /cleanup-image /spsuperseded /hidesp


Детальное описание команд см. на сайте майкрософта.

Примите во внимание, что на х64 системах Dism.exe находится в другой директории (SysNative или SysWOW64, зависит от типа ОС), поэтому путь к нему нужно указать явно, к примеру:

cd C:\Windows\SysNative

...и только потом выполнить вышеуказанную команду.

Еще одна опция, которая не требует знаний системного администрирования - это использовать стандартный тул "Очистка диска" (или Disk cleanup).

ПКМ на любом логическом диске, в вы падающем меню выберите "Свойства", затем кликните на "Очистка диска". Файлы бэкапа - это как раз то, что хранится в директории WinSxS.

Довольно просто да?


ОС Windows хорошо известна частым выпуском различных обновлений и пользователь обычно ставит все что появляется в "Центре обновления" не особо задумываясь, а может и вовсе не ставит. Во имя избежания ошибок с выполнением системных команд рекомендуется установить обновление KB2533552 (исправляет возникающие ошибки после установки SP1). Проверьте наличие данного апдейта и если его нет, скачайте и установите его (http://support.microsoft.com/kb/2533552)