Posts filed under 'Python'

Peeking at Large Files in Python

I have been parsing files that are in the multi-gigabyte range. Python can handle them pretty well, but it can still take awhile to chug through them. I have to be honest in saying I don’t know of any great tricks to speed this up. However, one thing that can be helpful when parsing large files is to read a few lines to be able to see the format. The following code will allow you to look at the first 100 lines of a text file with Python (like when you want to see the format of a large file without reading through all of it). To read the entire file, you just would take out the if statement.

inFileName = “associations.txt”
inFile = open(inFileName, ‘r’)
outFile = open(“peek_%s” % inFileName, ‘w’)

count = 0

for line in inFile:
  count += 1

if count <= 100:
  outFile.write(line)
else: break

outFile.close()
inFile.close()

July 14, 2008

Find Files in Directory Using Python

A nice solution to this is the path Python module. However, the following simple solution will do the trick. It doesn’t support wildcards at this point, but that could easily be added with some regular expression code.

def getFilesMatchingPattern(directory, nonWildCardPattern):
  fileList=os.listdir(directory)
  return [f for f in fileList if f.find(nonWildCardPattern) > -1]

1 comment July 10, 2008

Converting a String to a Boolean in Python

Let’s say you have a string value that you want to convert to a boolean, but you’re not sure the format it will be in. Some languages have built-in functions for doing this, but to my knowledge Python doesn’t. Here’s a way to do it (though it’s not comprehensive). (Thanks to the commenter who helped me see a simpler way to do this.)

def parseBoolString(theString):
  return theString[0].upper()==’T’

parseBoolString(“true”)

True

parseBoolString(“false”)

False

6 comments April 8, 2008

Simple Method to Search a Python List

Let’s say you have a list of objects of type Individual and that list is called individuals.

The Individual type contains an ID, name, and email address.

Let’s say you have an ID and want to get the corresponding Individual object from the list. How would you go about doing that?

match = [ind for ind in individuals if ind.id == theID]

3 comments April 4, 2008

Sorting Dictionaries in Python

Newer versions of Python have a built-in function called sorted that can help you sort dictionaries. Below is the basic functionality.

Sort by key:

sorted(x.items())

Sort by value:

class SortedDictionary:
  def __init__(self, dictToSort):
    self.keys = sorted(dictToSort.iterkeys())
    self.values = [dictToSort[key] for key in self.keys]
    self._lastIndex = -1

  def __iter__(self):
    return self

  def next(self):
    if self._lastIndex < (len(self.keys) - 1):
      self._lastIndex += 1
      return (self.keys[self._lastIndex], self.values[self._lastIndex])
    else:
      raise StopIteration

x = {}
x['abc'] = 1
x['aaa'] = 2

y = SortedDictionary(x)
print y.keys
print y.values

for z in y:
  print z

Add comment March 28, 2008

Append a List to a List in Python

(Note: Please see my latest posts at my new blog!)

An easy way to do this is with the extend function:

x = [1,2,3]

x.extend([4,5])

[1,2,3,4,5]

1 comment March 19, 2008

Simple Method to Calculate Median in Python

(Note: Please see my latest posts at my new blog!)

def getMedian(numericValues):
  theValues = sorted(numericValues)

  if len(theValues) % 2 == 1:
    return theValues[(len(theValues)+1)/2-1]
  else:
    lower = theValues[len(theValues)/2-1]
    upper = theValues[len(theValues)/2]

    return (float(lower + upper)) / 2  

def validate(valueShouldBe, valueIs):
  print “Value Should Be: %.6f, Value Is: %.6f, Correct: %s” % (valueShouldBe, valueIs, valueShouldBe==valueIs)  

validate(2.5, getMedian([0,1,2,3,4,5]))
validate(2, getMedian([0,1,2,3,4]))
validate(2, getMedian([3,1,2]))
validate(3, getMedian([3,2,3]))
validate(1.234, getMedian([1.234, 3.678, -2.467]))
validate(1.345, getMedian([1.234, 3.678, 1.456, -2.467]))

6 comments March 17, 2008

Filtering Data in Python (Example of Functional Programming Approach)

(Note: Please see my latest posts at my new blog!)

Let’s say you are doing a cancer study and have a list of patients of various ages in a tab-delimited file. You want to limit the study to patients who are 60 years or older. One way you could do this is use a for loop and process the data one row at a time and remove any patients below your threshold. Another way is to insert the data into a SQL database and use a WHERE clause to filter the data and then extract it back out. (That’s pretty desperate, but I know it happens!)

One simple way to do this in Python is to use the filter function. Let’s say you pull the data from the file into a series of tuples.

file = open("Patients.csv", 'r')
patients = [line.rstrip().split('\t') for line in file]

Now suppose age is the 3rd column in the data. You need to create a small function to determine whether a tuple meets the criteria:

def f(x): return int(x[2]) >= 60

Then you use the filter function and apply that function to the data.

matches = filter(f, patients)

This is a contrived example, but I hope it illustrates a beginning of how you might use functional programming and that it gives you a flavor for how this can be a powerful approach.

Add comment March 7, 2008

Computing Chi-Squared P-Value from Contingency Table in Python

(Note: Please see my latest posts at my new blog!)

Update: Here is a link to notes from my Stats class that gives some background  (starting on page 5): http://episun7.med.utah.edu/~alun/teach/stats/week05.pdf

To do this you need to have SciPy installed. Below is one way to do it. I’m sure there’s a more efficient way to do it. But this is working for me. Any feedback is welcome.

def computeContingencyTablePValue(*observedTuples):
  if len(observedTuples) == 0: return None

  for row in observedTuples:
    if len(row) != len(observedTuples[0]): return None

  rowSums = []
  for row in observedTuples:
    rowSums.append(float(sum(row)))

  columnSums = []
  for i in range(len(observedTuples[0])):
    columnSum = 0.0
  for row in observedTuples:
    columnSum += row[i]
    columnSums.append(float(columnSum))

  grandTotal = float(sum(rowSums))

  observedTestStatistic = 0.0
  for i in range(len(observedTuples)):
    for j in range(len(row)):
      expectedValue = (rowSums[i]/grandTotal)*(columnSums[j]/grandTotal)*grandTotal
      observedValue = float(observedTuples[i][j])

  observedTestStatistic += ((observedValue - expectedValue)**2) / expectedValue

  degreesFreedom = (len(columnSums) - 1) * (len(rowSums) - 1)
  return scipy.stats.chisqprob(observedTestStatistic, degreesFreedom)

Add comment February 13, 2008

Python GMail SMTP Example

(Note: Please see my latest posts at my new blog!)

I need to be able to send an email from my python script, and I wanted to be able to use my GMail for the outgoing SMTP server. It becomes a little tricky because the GMail servers require authentication. I searched around and found some good examples on the Internet and then fine tuned them a bit.

import os
import smtplib
import mimetypes
from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email.MIMEAudio import MIMEAudio
from email.MIMEImage import MIMEImage
from email.Encoders import encode_base64

def sendMail(subject, text, *attachmentFilePaths):
  gmailUser = 'yo.mama@gmail.com'
  gmailPassword = 'bogus!'
  recipient = 'test@test.com'

  msg = MIMEMultipart()
  msg['From'] = gmailUser
  msg['To'] = recipient
  msg['Subject'] = subject
  msg.attach(MIMEText(text))

  for attachmentFilePath in attachmentFilePaths:
    msg.attach(getAttachment(attachmentFilePath))

  mailServer = smtplib.SMTP('smtp.gmail.com', 587)
  mailServer.ehlo()
  mailServer.starttls()
  mailServer.ehlo()
  mailServer.login(gmailUser, gmailPassword)
  mailServer.sendmail(gmailUser, recipient, msg.as_string())
  mailServer.close()

  print('Sent email to %s' % recipient)

def getAttachment(attachmentFilePath):
  contentType, encoding = mimetypes.guess_type(attachmentFilePath)

  if contentType is None or encoding is not None:
    contentType = 'application/octet-stream'

  mainType, subType = contentType.split('/', 1)
  file = open(attachmentFilePath, 'rb')

  if mainType == 'text':
    attachment = MIMEText(file.read())
  elif mainType == 'message':
    attachment = email.message_from_file(file)
  elif mainType == 'image':
    attachment = MIMEImage(file.read(),_subType=subType)
  elif mainType == 'audio':
    attachment = MIMEAudio(file.read(),_subType=subType)
  else:
    attachment = MIMEBase(mainType, subType)
  attachment.set_payload(file.read())
  encode_base64(attachment)

  file.close()

  attachment.add_header('Content-Disposition', 'attachment',   filename=os.path.basename(attachmentFilePath))
  return attachment

Derived from: http://kutuma.blogspot.com/2007/08/sending-emails-via-gmail-with-python.html and http://mail.python.org/pipermail/python-list/2003-September/225540.html

11 comments January 4, 2008


Categories

Archives

Top Posts