Code Snippets
Here you'll find snippets of python code for doing various data processing tasks. Below each snippet of code is an IPython %loadpy magic function that can be used like this:
In [1]: %loadpy http://econpy.pythonanywhere.com/scripts/foo.py
The %loadpy magic function accepts URLs or paths to local .py python scripts and returns the contents of the script to your IPython terminal (without executing the script). For example, %loadpy is handy when you want to make a quick edit to a remote or local python script before executing it.
Generally Useful Functions
Create a list containing the names of all files in a directory (dir_name) and its subdirectories (if 'sub_dir' is True).
def dir_list(dir_name, sub_dir, *args): file_list = [] for file in os.listdir(dir_name): dirfile = os.path.join(dir_name, file) if os.path.isfile(dirfile): if len(args) == 0: file_list.append(dirfile) else: if os.path.splitext(dirfile)[1][1:] in args: file_list.append(dirfile) elif os.path.isdir(dirfile) and sub_dir: file_list += dir_list(dirfile, sub_dir, *args) return file_list
%loadpy http://econpy.pythonanywhere.com/scripts/dir_list.py
Merge all files in a list of files (file_list) into a single file (output_file).
def mergefiles(file_list, output_file): f = open(output_file, 'w') for file in file_list: print 'Writing file: %s' % file f.write(open(file).read()) f.close() print "File created: %s" % output_file
%loadpy http://econpy.pythonanywhere.com/scripts/mergefiles.py
Often times it's useful to combine the previous 2 scripts. That is, to have a function that takes the path to a directory as it's input and returns a single file that's created by merging every file in that directory (and all files in subdirectories too if sub_dir=True). To do so, use the entire dir_list function as the 'file_list' input of the mergefiles function. For example, the following command would merge every .txt file in /home/user/Desktop/Data (not including files in subdirectories) and save the merged content to a file called outputFile.txt in the working directory:
In [2]: mergefiles(dir_list('/home/user/Desktop/Data', False, 'txt'), 'outputFile.txt')
Remove all html tags from a string.
def striptags(raw_html): tag = [False] def checkit(i): if tag[0]: tag[0] = (i != '>') return False elif i == '<': tag[0] = True return False return True return ''.join(i for i in raw_html if checkit(i))
%loadpy http://econpy.pythonanywhere.com/scripts/striptags.py
Unique lines (order preserving).
def uniquify(myList, idfun=None): if idfun is None: def idfun(x): return x seen,result = {},[] for item in myList: marker = idfun(item) if marker in seen: continue seen[marker] = 1 result.append(item) return result
%loadpy http://econpy.pythonanywhere.com/scripts/uniquify.py
Return only the digits within a string using a lambda function.
def onlyDigits(myStr): return filter(lambda x: x.isdigit(), myStr)
%loadpy http://econpy.pythonanywhere.com/scripts/onlydigits.py
Other Tricks and Tools
Change your user-agent using the requests module.
import requests r = requests.get("http://econpy.pythonanywhere.com/ex/cpu.html") oldUA = r.config['base_headers']['User-Agent'] newHeader = r.config['base_headers'] newHeader['User-Agent'] = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11' print "OLD: %s" % oldUA print "NEW: %s" % newHeader['User-Agent']
%loadpy http://econpy.pythonanywhere.com/scripts/useragent.py
Scrape data from multiple websites using regular expressions.
import requests,re URLs = ['http://econpy.pythonanywhere.com/ex/001.html', 'http://econpy.pythonanywhere.com/ex/cpu.html'] reStr = '<title>(.*?)</title>' for url in URLs: page = requests.get(url).content print re.findall(re.compile(reStr), page)
%loadpy http://econpy.pythonanywhere.com/scripts/regexmultiplepages.py