Python

Steps while creating a new python project

Whenever you create a new python project, make sure they have the following components.

  • README.rst - Describes the project name, purpose, installation procedure, relevant publications, acknowledge contributors.

  • LICENSE

    • It allows others to reuse your code in a hassle free way. Coose a license.
    • MIT License is the most simple and permissive, Other options are GNU GPL v3 or Apache License Version 2.
    • It is a good idea to have a copy of these license files in ~/.vim/ folder and define custom mappings as shown below.
    • It is very convenient to have a VIM mapping such as map :mit :0r path_to_mit_license_file
  • setup.py - Useful for building, packaging and distributing your code.

    • pip install --upgrade setuptools
    • Follow setuptools documentation for creating this file.
    • Look at the sample project’s setup.py.
  • requirements.txt - Describes the exact dependencies required by your project.

    • This file can be generated automatically for which there exist multiple methods.
    • First method, use virtualenv to create a new virtual environment. Switch to this environment. Do fresh pip installs and pip freeze > requirements.txt
    • Second method, use pipenv to create a new virtual environment using pipenv install, activate it using pipenv shell and do regular pip installs pipenv will automatically add the package to the pipenv file that’s called Pipfile
    • Third (recommended) method, install pipreqs and execute pipreqs /home/project/location. It will create a requirements.txt in the specified location.
  • your_package_folder/__init__.py - An empty file which tells python to treat the folder as package.

  • your_package_folder/your_modules.py - python modules in your package.

  • docs/conf.py - docs represents document folder. The conf.py indicates configuration file for Sphinx document builder.

  • docs/index.rst - The index file which contains the reference for other document files.

    • Create a folder docs in the repository root. Execute sphinx-quickstart.

    • Follow steps given here

    • Edit the docs/conf.py so that it contains the following lines (along with other lines).

      import os
      import sys
      sys.path.insert(0, os.path.abspath('..'))
      autodoc_member_order = 'groupwise'
      extensions = ['sphinx.ext.autodoc']
      
    • Add modules to index.rst

    • Run the command sphinx-apidoc -o your_project_docs_folder_path your_project_path

  • tests/ - This folder contains seperate testing code.

Documentation

Use Ipython inside a Python Program

To inspect variables in a python script (which takes long to run), you can insert the following lines in your code to start an ipython kernel.

from IPython import embed embed() # Place this line somewhere in your program

Anaconda or Pip

  • Always use python provided by Anaconda (Do not use the default python provided by ubuntu). The difference between pip and conda is given here
  • Always try to install packages using conda.
    • In Anaconda, you can create multiple environments. The python version and package installations in those environments are mutually exclusive.
    • Environment creation - conda create --name snakes python=3
      • Environment activation - source activate snakes
      • Environment deactivation - source deactivate snakes
      • I have 3 environments on my Mac. * One for default (Python2) * One for python3 * One for Magenta
      • To install a specific python version in an environment use conda install python=2.6
      • You can also search for various versions of a package using conda search packageName
      • You can list the existing conda environments using conda env list
  • To install packages through pip use pip install package_name. To
  • To upgrade installed packages pip install package_name --upgrade
  • Recently, I encountered the issue. Upon upgrading to python 3.6 in miniconda, * I was getting the following errors with pip installs - pip install failing due to ssl module not available.. I fixed it using the following commands. * source activate snakes (My python 3.6.2 virtual environment) * conda update openssl.

Multiprocessing

  • Multiprocessing is a highly convenient option for parallel processing in python. The following is a sample script that takes a string list as input and modifies the strings in a parallel fashion.

  • There are scenarios, where data needs to be shared between multiple threads (e.g. increment or decrement global variables). * If it is a counter, always try to pass counter values as additional inputs rather than sharing them between processes.

    • Note - Global variables are not shared between processes. We need special kinds of variables

      from multiprocessing import Pool, Value, Manager, Lock
      counter = Value('i', 0) # Globally accessible, defined in __main__ function. 'i' represents integer
      
      # Dictionary Initializations
      manager = Manager()
      word_dict = manager.dict()
      lemma_dict = manager.dict()
      pos_dict = manager.dict()
      
      # Locks Initializations
      l1 = Lock()
      l2 = Lock()
      
      # In the function which is going to be called by multiple processes
      l1.acquire()
      counter.value += 1
      l1.release()
      
      l2.acquire()
      word_dict[word] = len(word_dict) + 1
      lemma_dict[lemma] = len(lemma_dict) + 1
      pos_dict[word] = len(pos_dict) + 1
      l2.release()
      
    • Note: The manager.dict() are dummy dictionaries. You cannot dump them as simple pickle objects and expect to work like normal python dictionaries when you pickle-load them again!! Therefore, write aconverter script to convert them into normal python dictionaries and then pickle-dump them.

  • I have personally encountered some issues while using multiprocessing with nltk on my mac. However, the same code with the same nltk version runs on ubuntu. There are many others who have expressed similar concerns (incompatability of nltk and multiprocessing)

Numpy

  • You can check if a numpy array contains nan or inf values. Usually, such arrays are problematic

    import numpy as np
    aa = np.array ([1,3,4])
    bb = np.array ([1,0,0])
    cc = aa / bb # Raises divide by zero encountered exception
    np.isnan(cc).any() # Checks for Nan values in array.
    np.isinf(cc).any() # Checks for Inf values in array.
    

Python HTTP requests

  • I have used HTTP Post request to run the DBpedia spotlight

    import urllib3
    import json
    import requests
    
    headers = {'Accept': 'application/json'}
    url = 'http://localhost:2222/rest/disambiguate'
    data = {"text" : '<annotation text="Keep us posted, Carlleton. Similar
    problem here. I managed to get my D up after 70 months of high dose
    supplement, but after two years have now dropped Back into the land of
    Osteomalacia"> <surfaceForm name="Back" offset="152">
    </surfaceForm><surfaceForm name="Osteomalacia"
    offset="174"></surfaceForm></annotation>'}
    r = requests.post(url, data=data, headers=headers)
    print (r.text)
    
  • Note: For calling GET requests use requests.get function. While calling GET function, make sure to change the header key to Content-Type instead of Accept.

Sacred

  • Sacred is a useful tool in python for parameter sweeping experiments.
  • pip install sacred
  • It stores all the information about an experiment run in a MongoDB. For that you need to setup MongoDB on your system and also have pymongo installed. More help is available here

Other Packages

  • One of the useful aspects of python is pickle. I had pickled huge word vectors file and loading it back took less than 10 seconds.

  • A super awesome feature in python is the ability to pickle objects. However, you cannot pickle lambda functions or objects that depend on lambda function. The reason for this is that functions are pickled by name, not by code. Unpickling will only work if a function with the same name is present in in the same module. This is why pickling a lambda won’t work: they have no individual names.

  • One useful package for printing python output in multiple colors is termcolor conda install -c omnia termcolor

    from termcolor import colored
    print (colored('Hello','green'))
    
  • There is this cool plugin in ipython notebooks called storemagic to persist python objects which are picklable.