Posts filed under 'Python'
Peeking at Large Files in Python
I have been parsing files that are in the multi-gigabyte range. Python can handle them pretty well, but it can still take awhile to chug through them. I have to be honest in saying I don’t know of any great tricks to speed this up. However, one thing that can be helpful when parsing large files is to read a few lines to be able to see the format. The following code will allow you to look at the first 100 lines of a text file with Python (like when you want to see the format of a large file without reading through all of it). To read the entire file, you just would take out the if statement.
inFileName = “associations.txt”
inFile = open(inFileName, ‘r’)
outFile = open(“peek_%s” % inFileName, ‘w’)
count = 0
for line in inFile:
count += 1
if count <= 100:
outFile.write(line)
else: break
outFile.close()
inFile.close()
July 14, 2008
Find Files in Directory Using Python
A nice solution to this is the path Python module. However, the following simple solution will do the trick. It doesn’t support wildcards at this point, but that could easily be added with some regular expression code.
def getFilesMatchingPattern(directory, nonWildCardPattern):
fileList=os.listdir(directory)
return [f for f in fileList if f.find(nonWildCardPattern) > -1]
1 comment July 10, 2008
Converting a String to a Boolean in Python
Let’s say you have a string value that you want to convert to a boolean, but you’re not sure the format it will be in. Some languages have built-in functions for doing this, but to my knowledge Python doesn’t. Here’s a way to do it (though it’s not comprehensive). (Thanks to the commenter who helped me see a simpler way to do this.)
def parseBoolString(theString):
return theString[0].upper()==’T’
parseBoolString(“true”)
True
parseBoolString(“false”)
False
6 comments April 8, 2008
Simple Method to Search a Python List
Let’s say you have a list of objects of type Individual and that list is called individuals.
The Individual type contains an ID, name, and email address.
Let’s say you have an ID and want to get the corresponding Individual object from the list. How would you go about doing that?
match = [ind for ind in individuals if ind.id == theID]
3 comments April 4, 2008
Sorting Dictionaries in Python
Newer versions of Python have a built-in function called sorted that can help you sort dictionaries. Below is the basic functionality.
Sort by key:
sorted(x.items())
Sort by value:
class SortedDictionary:
def __init__(self, dictToSort):
self.keys = sorted(dictToSort.iterkeys())
self.values = [dictToSort[key] for key in self.keys]
self._lastIndex = -1
def __iter__(self):
return self
def next(self):
if self._lastIndex < (len(self.keys) - 1):
self._lastIndex += 1
return (self.keys[self._lastIndex], self.values[self._lastIndex])
else:
raise StopIteration
x = {}
x['abc'] = 1
x['aaa'] = 2
y = SortedDictionary(x)
print y.keys
print y.values
for z in y:
print z
Add comment March 28, 2008
Append a List to a List in Python
(Note: Please see my latest posts at my new blog!)
An easy way to do this is with the extend function:
x = [1,2,3]
x.extend([4,5])
[1,2,3,4,5]
1 comment March 19, 2008
Simple Method to Calculate Median in Python
(Note: Please see my latest posts at my new blog!)
def getMedian(numericValues):
theValues = sorted(numericValues)
if len(theValues) % 2 == 1:
return theValues[(len(theValues)+1)/2-1]
else:
lower = theValues[len(theValues)/2-1]
upper = theValues[len(theValues)/2]
return (float(lower + upper)) / 2
def validate(valueShouldBe, valueIs):
print “Value Should Be: %.6f, Value Is: %.6f, Correct: %s” % (valueShouldBe, valueIs, valueShouldBe==valueIs)
validate(2.5, getMedian([0,1,2,3,4,5]))
validate(2, getMedian([0,1,2,3,4]))
validate(2, getMedian([3,1,2]))
validate(3, getMedian([3,2,3]))
validate(1.234, getMedian([1.234, 3.678, -2.467]))
validate(1.345, getMedian([1.234, 3.678, 1.456, -2.467]))
6 comments March 17, 2008
Filtering Data in Python (Example of Functional Programming Approach)
(Note: Please see my latest posts at my new blog!)
Let’s say you are doing a cancer study and have a list of patients of various ages in a tab-delimited file. You want to limit the study to patients who are 60 years or older. One way you could do this is use a for loop and process the data one row at a time and remove any patients below your threshold. Another way is to insert the data into a SQL database and use a WHERE clause to filter the data and then extract it back out. (That’s pretty desperate, but I know it happens!)
One simple way to do this in Python is to use the filter function. Let’s say you pull the data from the file into a series of tuples.
file = open("Patients.csv", 'r') patients = [line.rstrip().split('\t') for line in file]
Now suppose age is the 3rd column in the data. You need to create a small function to determine whether a tuple meets the criteria:
def f(x): return int(x[2]) >= 60
Then you use the filter function and apply that function to the data.
matches = filter(f, patients)
This is a contrived example, but I hope it illustrates a beginning of how you might use functional programming and that it gives you a flavor for how this can be a powerful approach.
Add comment March 7, 2008
Computing Chi-Squared P-Value from Contingency Table in Python
(Note: Please see my latest posts at my new blog!)
Update: Here is a link to notes from my Stats class that gives some background (starting on page 5): http://episun7.med.utah.edu/~alun/teach/stats/week05.pdf
To do this you need to have SciPy installed. Below is one way to do it. I’m sure there’s a more efficient way to do it. But this is working for me. Any feedback is welcome.
def computeContingencyTablePValue(*observedTuples):
if len(observedTuples) == 0: return None
for row in observedTuples:
if len(row) != len(observedTuples[0]): return None
rowSums = []
for row in observedTuples:
rowSums.append(float(sum(row)))
columnSums = []
for i in range(len(observedTuples[0])):
columnSum = 0.0
for row in observedTuples:
columnSum += row[i]
columnSums.append(float(columnSum))
grandTotal = float(sum(rowSums))
observedTestStatistic = 0.0
for i in range(len(observedTuples)):
for j in range(len(row)):
expectedValue = (rowSums[i]/grandTotal)*(columnSums[j]/grandTotal)*grandTotal
observedValue = float(observedTuples[i][j])
observedTestStatistic += ((observedValue - expectedValue)**2) / expectedValue
degreesFreedom = (len(columnSums) - 1) * (len(rowSums) - 1)
return scipy.stats.chisqprob(observedTestStatistic, degreesFreedom)
Add comment February 13, 2008
Python GMail SMTP Example
(Note: Please see my latest posts at my new blog!)
I need to be able to send an email from my python script, and I wanted to be able to use my GMail for the outgoing SMTP server. It becomes a little tricky because the GMail servers require authentication. I searched around and found some good examples on the Internet and then fine tuned them a bit.
import os
import smtplib
import mimetypes
from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email.MIMEAudio import MIMEAudio
from email.MIMEImage import MIMEImage
from email.Encoders import encode_base64
def sendMail(subject, text, *attachmentFilePaths):
gmailUser = 'yo.mama@gmail.com'
gmailPassword = 'bogus!'
recipient = 'test@test.com'
msg = MIMEMultipart()
msg['From'] = gmailUser
msg['To'] = recipient
msg['Subject'] = subject
msg.attach(MIMEText(text))
for attachmentFilePath in attachmentFilePaths:
msg.attach(getAttachment(attachmentFilePath))
mailServer = smtplib.SMTP('smtp.gmail.com', 587)
mailServer.ehlo()
mailServer.starttls()
mailServer.ehlo()
mailServer.login(gmailUser, gmailPassword)
mailServer.sendmail(gmailUser, recipient, msg.as_string())
mailServer.close()
print('Sent email to %s' % recipient)
def getAttachment(attachmentFilePath):
contentType, encoding = mimetypes.guess_type(attachmentFilePath)
if contentType is None or encoding is not None:
contentType = 'application/octet-stream'
mainType, subType = contentType.split('/', 1)
file = open(attachmentFilePath, 'rb')
if mainType == 'text':
attachment = MIMEText(file.read())
elif mainType == 'message':
attachment = email.message_from_file(file)
elif mainType == 'image':
attachment = MIMEImage(file.read(),_subType=subType)
elif mainType == 'audio':
attachment = MIMEAudio(file.read(),_subType=subType)
else:
attachment = MIMEBase(mainType, subType)
attachment.set_payload(file.read())
encode_base64(attachment)
file.close()
attachment.add_header('Content-Disposition', 'attachment', filename=os.path.basename(attachmentFilePath))
return attachment
Derived from: http://kutuma.blogspot.com/2007/08/sending-emails-via-gmail-with-python.html and http://mail.python.org/pipermail/python-list/2003-September/225540.html
11 comments January 4, 2008