Tuesday, 10 December 2013

Awk for Combining Two CSV Files and Averaging by Row Value

The Problem


I have a lot of information about books contained in a number of different CSV files. Each file has a different piece of information, such as the number of pages or the genre etc. Each file has the ISBN number in the first column as an identifier but not all ISBNs appear in all files. What I want to do is find out, for example, the average number of pages in a book in a given genre. To do this I need to combine two files and average across the pages column based on the genre column.

The Solution


The solution I have used is based heavily on two responses in [1] and [2]. The script is as follows:

awk -F',' 'FNR==NR{a[$1]=$2;next}($1 in a){sum[a[$1]]+=$2;N[a[$1]]++}END{for (key in sum){avg=sum[key]/N[key];printf "%s, %f, %i\n", key, avg, N[key];} }' genreFile.csv pagesFile.csv

where genreFile and pagesFile are the two input files. Both have the same format of:

ISBN,<value>

In the genreFile, the value is a number identifying a genre and in the pagesFile it is the number of pages in the book. The script runs as follows:

awk -F','

This sets the delimiter to be a comma

FNR==NR{a[$1]=$2;next}

FNR is the number of input records (or line) processed from the current file and NR is the total number processed. At first we have a portion of code that runs only when FNR==NR which means it only runs while processing the first file. What it does is create an array a which maps the first column (an ISBN number) to a genre. When the first file is finished the script moves on to the next part:

($1 in a){sum[a[$1]]+=$2;N[a[$1]]++}

This part runs during the second file. The round brackets serve as an if so that if the ISBN number on the current line in the second file existed in the first file then we go further. What we do is create two new arrays. One holds the sum of the number of pages for every book in a specific genre and the second keeps count of the number of books in that genre.

When both files have been processed, the results are calculated and printed with:

END{for (key in sum){avg=sum[key]/N[key];printf "%s, %f, %i\n", key, avg, N[key];} }

This simply looks over the arrays calculating the average and printing it out. 

Performance


Both input files have more than 4 million records in them and over 3 million shared ISBN numbers. using Linux's time function, the awk script takes just under 15 seconds to complete.

References


[1] http://stackoverflow.com/questions/14984340/using-awk-to-process-input-from-multiple-files
[2] http://stackoverflow.com/questions/4046430/average-of-column-by-hours-rows-using-awk

1 comment:

  1. While some supplies, like polymers, primarily require the development of Direct CNC a new new} supplies industry for 3D printing, additive producers can use previously present wire, steel powders or feedstock. While steel is troublesome to use in standard manufacturing methods, with 3D printing, producers can more simply make elements from the identical materials that they would normally machine. Developed by Saudi Basic Industries Corporation , the world’s first concept automobile was made with 3D printing expertise, of which the physique makes use of progressive supplies and processing expertise.

    ReplyDelete