PyF
From Fabelier
Flow-based Python programming
May 25th, 2011 - 20H00 @ CRI by Renaud Lifchitz
Introduction to PyF
PyF is a python open source framework and platform dedicated to large data processing, mining, transforming, reporting and more.
What is an ETL? http://en.wikipedia.org/wiki/Extract,_transform,_load
- Project page: http://www.pyfproject.org/
- Installation: http://www.pyfproject.org/en/getting-started
- Configuration: http://www.pyfproject.org/en/getting-started/configuring
- Architecture: http://www.pyfproject.org/en/welcome/components
- Plugins: http://www.pyfproject.org/documentation/contents/plugins/
First example: simple RSS to CSV converter
PyF tube:
Producer code:
import feedparser, time def get_source(): d = feedparser.parse("http://rss.lemonde.fr/c/205/f/3050/index.rss") size = len(d['entries']) for i,entry in enumerate(d['entries']): progression_callback(float(i+1)/size*100) message_callback("[NEWS] %s" % entry.title) yield entry time.sleep(0.5)
Filter expression:
item.updated_parsed.tm_hour in range(12,14)
Second example: Multi-page web scraper
PyF tube:
Individual item XPath:
/html[1]/body[1]/section[1]/div[1]/article/header[1]/h1[1]/a[2]
Other pages url XPath:
//a[starts-with(text(),"Suivant")]//@href
href computed attribute:
"http://linuxfr.org%s" % base_item.href[0]