Friday, December 13, 2013

parsing technorati - part I

Task: Download all technorati politics blogs listings, import them into FM pro and parse each blog

First, figure out how many pages actually hold blogs. Not all pages in the listing will have actual blogs. In the politics set, There are 697 pages listed but the last blog is on page 691.

In terminal, create a folder and download the file using wget

wget http://www.technorati.com/blogs/directory/politics/page-{1..691}

Filemaker won't import a folder without the files having certain endings, a little bash work and that's done

for i in *; do mv "$i" "$i.txt"; done

Now off to filemaker. I create a new database and import a folder, choosing text files as the file type and tell it to create a new table. The text content of each file is loaded into the new database along with the name of the file and the location of the import. Once that's done, all that's left is to parse out the individual blogs.

But that's a tale for part II

No comments:

Post a Comment