PETRARCH Package

petrarch2 Module

petrarch2.check_discards(SentenceText)

Checks whether any of the discard phrases are in SentenceText, giving priority to the + matches. Returns [indic, match] where indic

0 : no matches 1 : simple match 2 : story match [+ prefix]
petrarch2.close_tex(fname)
petrarch2.do_coding(event_dict)

Main coding loop Note that entering any character other than ‘Enter’ at the prompt will stop the program: this is deliberate. <14.02.28>: Bug: PETRglobals.PauseByStory actually pauses after the first

sentence of the next story
petrarch2.get_issues(SentenceText)

Finds the issues in SentenceText, returns as a list of [code,count]

<14.02.28> stops coding and sets the issues to zero if it finds any ignore phrase

petrarch2.get_version()
petrarch2.main()
petrarch2.open_tex(filename)
petrarch2.parse_cli_args()

Function to parse the command-line arguments for PETRARCH2.

petrarch2.read_dictionaries(validation=False)
petrarch2.run(filepaths, out_file, s_parsed)
petrarch2.run_pipeline(data, out_file=None, config=None, write_output=True, parsed=False)

PETRglobals Module

PETRreader Module

exception PETRreader.DateError

Bases: exceptions.Exception

PETRreader.check_attribute(targattr)

Looks for targetattr in AttributeList; returns value if found, null string otherwise.

PETRreader.close_FIN()
PETRreader.dstr_to_ordate(datestring)

Computes an ordinal date from a Gregorian calendar date string YYYYMMDD or YYMMDD.

PETRreader.extract_attributes(theline)

Structure of attributes extracted to AttributeList At present, these always require a quoted field which follows an ‘=’, though it probably makes sense to make that optional and allow attributes without content

PETRreader.find_tag(tagstr)
PETRreader.get_attribute(targattr)

Similar to check_attribute() except it raises a MissingAttr error when the attribute is missing.

PETRreader.make_noun_list(nounst)
PETRreader.make_plural_noun(noun)

Create the plural of a synonym noun st

PETRreader.open_FIN(filename, descrstr)
PETRreader.parse_Config(config_path)

Parse PETRglobals.ConfigFileName. The file should be ; the default is PETR_config.ini in the working directory but this can be changed using the -c option in the command line. Most of the entries are obvious (but will eventually be documented) with the exception of

  1. actorfile_list and textfile_list are comma-delimited lists. Per the usual rules

    for Python config files, these can be continued on the next line provided the the first char is a space or tab.

  2. If both textfile_list and textfile_name are present, textfile_list takes priority. textfile_list should be the name of a file containing text file names; # is allowed as a comment delimiter at the beginning of individual lines and following the file name.

  3. For additional info on config files, see

    http://docs.python.org/3.4/library/configparser.html

    or try Google, but basically, it is fairly simple, and you can probably just follow the examples.

PETRreader.read_FIN_line()

def read_FIN_line(): Reads a line from the input stream fin, deleting xml comments and lines beginning with # returns next non-empty line or EOF tracks the current line number (FINnline) and content (FINline) calling function needs to handle EOF (len(line) == 0)

PETRreader.read_actor_dictionary(actorfile)

This is a simple dictionary of dictionaries indexed on the words in the actor string. The final node has the key ‘#’ and contains codes with their date restrictions and, optionally, the root phrase in the case of synonyms.

Example:

UFFE_ELLEMANN_JENSEN_ [IGOEUREEC 820701-821231][IGOEUREEC 870701-871231] # president of the CoEU from DENMARK# IGOrulers.txt

the actor above is stored as:

{u’UFFE’: {u’ELLEMANN’: {u’JENSEN’: {u’#’: [(u’IGOEUREEC’, [u‘820701’, u‘821231’]), (u’IGOEUREEC’, [u‘870701’, u‘871231’])]}}}}

PETRreader.read_agent_dictionary(agent_path)

Reads an agent dictionary Agents are stored in a simpler version of the Actors dictionary: a list of phrases keyed on the first word of the phrase. The individual phrase lists begin with the code, the connector from the key, and then a series of 2-tuples containing the remaining words and connectors. A 2-tuple of the form (‘’, ‘ ‘) signals the end of the list.

Connector:
blank: words can occur between the previous word and the next word _ (underscore): words must be consecutive: no intervening words

FORMATTING OF THE AGENT DICTIONARY [With some additional coding, this can be relaxed, but anything following these rules should read correctly] Basic structure is a series of records of the form

phrase_string {optional plural} [agent_code]

Material that is ignored 1. Anything following ‘#’ 2. Any line beginning with ‘#’ or ‘<!’ 3. Any null line (that is, line consisting of only

A “phrase string” is a set of character strings separated by either blanks or underscores.

A “agent_code” is a character string without blanks that is either preceded (typically) or followed by ‘~’. If the ‘~’ precedes the code, the code is added after the actor code; if it follows the code, the code is added before the actor code (usually done for organizations, e.g. NGO~)

Plurals:

Regular plurals – those formed by adding ‘S’ to the root, adding ‘IES’ if the root ends in ‘Y’, and added ‘ES’ if the root ends in ‘SS’ – are generated automatically

If the plural has some other form, it follows the root inside {...}

If a plural should not be formed – that is, the root is only singular or only plural, or the singular and plural have the same form (e.g. “police”), use a null string inside {}.

If there is more than one form of the plural – “attorneys general” and “attorneys generals” are both in use – just make a second entry with one of the plural forms nulled (though in this instance – ain’t living English wonderful? – you could null the singular and use an automatic plural on the plural form) Though in a couple test sentences, this phrase confused SCNLP.

Substitution Markers:

These are used to handle complex equivalents, notably

!PERSON! = MAN, MEN, WOMAN, WOMEN, PERSON !MINST! = MINISTER, MINISTERS, MINISTRY, MINISTRIES

and used in the form

CONGRESS!PERSON! [~LEG] !MINIST!_OF_INTERNAL_AFFAIRS

The marker for the substitution set is of the form !...! and is followed by an = and a comma-delimited list; spaces are stripped from the elements of the list so these can be added for clarity. Every time in the list is substituted for the marker, with no additional plural formation, so the first construction would generate

CONGRESSMAN [~LEG] CONGRESSMEN [~LEG] CONGRESSWOMAN [~LEG] CONGRESSWOMEN [~LEG] CONGRESSPERSON [~LEG]

== Example === <!– PETRARCH VALIDATION SUITE AGENTS DICTIONARY –> <!– VERSION: 0.1 –> <!– Last Update: 27 November 2013 –>

PARLIAMENTARY_OPPOSITION {} [~OPP] #jap 11 Oct 2002 AMBASSADOR [~GOV] # LRP 02 Jun 2004 COPTIC_CHRISTIAN [~CHRCPT] # BNL 10 Jan 2002 FOREIGN_MINISTER [~GOVFRM] # jap 4/14/01 PRESIDENT [~GOVPRS] # ns 6/26/01 AIR_FORCE {} [~MIL] # ab 06 Jul 2005 OFFICIAL_MEDIA {} [~GOVMED] # ab 16 Aug 2005 ATTORNEY_GENERAL {ATTORNEYS_GENERAL} [~GOVATG] # mj 05 Jan 2006 FOREIGN_MINISTRY [~GOV] # mj 17 Apr 2006 HUMAN_RIGHTS_ACTIVISTS [NGM~] # ns 6/14/01 HUMAN_RIGHTS_BODY [NGO~] # BNL 07 Dec 2001 TROOP [~MIL] # ab 22 Aug 2005

PETRreader.read_discard_list(discard_path)

Reads file containing the discard list: these are simply lines containing strings. If the string, prefixed with ‘ ‘, is found in the <Text>...</Text> sentence, the sentence is not coded. Prefixing the string with a ‘+’ means the entire story is not coded with the string is found [see read_record() for details on story/sentence identification]. If the string ends with ‘_’, the matched string must also end with a blank or punctuation mark; otherwise it is treated as a stem. The matching is not case sensitive.

The file format allows # to be used as a in-line comment delimiter.

File is stored as a simple list and the interpretation of the strings is done in check_discards()

===== EXAMPLE ===== +5K RUN # ELH 06 Oct 2009 +ACADEMY AWARD # LRP 08 Mar 2004 AFL GRAND FINAL # MleH 06 Aug 2009 AFRICAN NATIONS CUP # ab 13 Jun 2005 AMATEUR BOXING TOURNAMENT # CTA 30 Jul 2009 AMELIA EARHART ANDRE AGASSI # LRP 10 Mar 2004 ASIAN CUP # BNL 01 May 2003 ASIAN FOOTBALL # ATS 9/27/01 ASIAN MASTERS CUP # CTA 28 Jul 2009 +ASIAN WINTER GAMES # sls 14 Mar 2008 ATP HARDCOURT TOURNAMENT # mj 26 Apr 2006 ATTACK ON PEARL HARBOR # MleH 10 Aug 2009 AUSTRALIAN OPEN AVATAR # CTA 14 Jul 2009 AZEROTH # CTA 14 Jul 2009 (World of Warcraft) BADMINTON # MleH 28 Jul 2009 BALLCLUB # MleH 10 Aug 2009 BASEBALL BASKETBALL BATSMAN # MleH 14 Jul 2009 BATSMEN # MleH 12 Jul 2009

PETRreader.read_issue_list(issue_path)

“Issues” do simple string matching and return a comma-delimited list of codes. The standard format is simply

<string> [<code>]

For purposes of matching, a ‘ ‘ is added to the beginning and end of the string: at present there are not wild cards, though that is easily added.

The following expansions can be used (these apply to the string that follows up to the next blank)

n: Create the singular and plural of the noun v: Create the regular verb forms (‘S’,’ED’,’ING’) +: Create versions with ‘ ‘ and ‘-‘

The file format allows # to be used as a in-line comment delimiter.

File is stored in PETRglobals.IssueList as a list of tuples (string, index) where index refers to the location of the code in PETRglobals.IssueCodes. The coding is done in check_issues()

Issues are written to the event record as a comma-delimited list to a tab-delimited field, e.g.

20080801 ABC EDF 0001 POSTSECONDARY_EDUCATION 2, LITERACY 1 AFP0808-01-M008-02 20080801 ABC EDF 0004 AFP0808-01-M007-01 20080801 ABC EDF 0001 NUCLEAR_WEAPONS 1 AFP0808-01-M008-01

where XXXX NN, corresponds to the issue code and the number of matched phrases in the sentence that generated the event.

This feature is optional and triggered by a file name in the PETR_config.ini file at

issuefile_name = Phoenix.issues.140225.txt

<14.02.28> NOT YET FULLY IMPLEMENTED The prefixes ‘~’ and ‘~~’ indicate exclusion phrases:

~ : if the string is found in the current sentence, do not code any of the issues
in section – delimited by <ISSUE CATEGORY=”...”>...</ISSUE> – containing the string
~~ : if the string is found in the current story, do not code any of the issues
in section

In the current code, the occurrence of an ignore phrase of either type cancels all coding of issues from the sentence

===== EXAMPLE =====

<ISSUE CATEGORY=”ID_ATROCITY”> n:atrocity [ID_ATROCITY] n:genocide [ID_ATROCITY] ethnic cleansing [ID_ATROCITY] ethnic v:purge [ID_ATROCITY] ethnic n:purge [ID_ATROCITY] war n:crime [ID_ATROCITY] n:crime against humanity [ID_ATROCITY] n:massacre [ID_ATROCITY] v:massacre [ID_ATROCITY] al+zarqawi network [NAMED_TERROR_GROUP] ~Saturday Night massacre ~St. Valentine’s Day massacre ~~Armenian genocide # not coding historical cases </ISSUE>

PETRreader.read_pipeline_input(pipeline_list)

Reads input from the processing pipeline and MongoDB and creates the global holding dictionary. Please consult the documentation for more information on the format of the global holding dictionary. The function iteratively parses each file so is capable of processing large inputs without failing.

Parameters:

pipeline_list: List. :

List of dictionaries as stored in the MongoDB instance. These records are originally generated by the web scraper.

Returns:

holding: Dictionary. :

Global holding dictionary with StoryIDs as keys and various sentence- and story-level attributes as the inner dictionaries. Please refer to the documentation for greater information on the format of this dictionary.

PETRreader.read_verb_dictionary(verb_path)

Verb storage:

Storage sequence:

Upper Noun phrases

Upper prepositional phrases
Lower noun phrases

Lower prepositional phrases
#
  • symbol acts as extender, indicating the noun phrase is longer

, symbol acts as delimiter between several selected options

PETRreader.read_xml_input(filepaths, parsed=False)

Reads input in the PETRARCH XML-input format and creates the global holding dictionary. Please consult the documentation for more information on the format of the global holding dictionary. The function iteratively parses each file so is capable of processing large inputs without failing.

Parameters:

filepaths: List. :

List of XML files to process.

parsed: Boolean. :

Whether the input files contain parse trees as generated by StanfordNLP.

Returns:

holding: Dictionary. :

Global holding dictionary with StoryIDs as keys and various sentence- and story-level attributes as the inner dictionaries. Please refer to the documentation for greater information on the format of this dictionary.

PETRreader.show_verb_dictionary(filename=u'')

PETRwriter Module

PETRwriter.get_actor_text(meta_strg)

Extracts the source and target strings from the meta string.

PETRwriter.pipe_output(event_dict)

Format the coded event data for use in the processing pipeline.

Parameters:

event_dict: Dictionary. :

The main event-holding dictionary within PETRARCH.

Returns:

final_out: Dictionary. :

StoryIDs as the keys and a list of coded event tuples as the values, i.e., {StoryID: [(full_record), (full_record)]}. The full_record portion is structured as (story_date, source, target, code, joined_issues, ids, StorySource) with the joined_issues field being optional. The issues are joined in the format of ISSUE,COUNT;ISSUE,COUNT. The IDs are joined as ID;ID;ID.

PETRwriter.write_events(event_dict, output_file)

Formats and writes the coded event data to a file in a standard event-data format.

Parameters:

event_dict: Dictionary. :

The main event-holding dictionary within PETRARCH.

output_file: String. :

Filepath to which events should be written.

PETRwriter.write_nullactors(event_dict, output_file)

Formats and writes the null actor data to a file as a set of lines in a JSON format.

Parameters:

event_dict: Dictionary. :

The main event-holding dictionary within PETRARCH.

output_file: String. :

Filepath to which events should be written.

PETRwriter.write_nullverbs(event_dict, output_file)

Formats and writes the null verb data to a file as a set of lines in a JSON format.

Parameters:

event_dict: Dictionary. :

The main event-holding dictionary within PETRARCH.

output_file: String. :

Filepath to which events should be written.

utilities Module

utilities.code_to_string(events)

Converts an event into a string, replacing the integer codes with strings representing their value in hex

utilities.combine_code(selfcode, to_add)

Combines two verb codes, part of the verb interaction framework

Parameters:

selfcode,to_add: ints :

Upper and lower verb codes, respectively

Returns:

combined value :

utilities.convert_code(code, forward=1)

Convert a verb code between CAMEO and the Petrarch internal coding ontology.

New coding scheme:

0 0 0 0 2 Appeal 1 Reduce 1 Meet 1 Leadership 3 Intend 2 Yield 2 Settle 2 Policy 4 Demand 3 Mediate 3 Rights 5 Protest 4 Aid 4 Regime 6 Threaten 5 Expel 5 Econ 1 Say 6 Pol. Change 6 Military 7 Disapprove 7 Mat. Coop 7 Humanitarian 8 Posture 8 Dip. Coop 8 Judicial 9 Coerce 9 Assault 9 Peacekeeping A Investigate A Fight A Intelligence B Consult B Mass violence B Admin. Sanctions

C Dissent
D Release
E Int’l Involvement
F D-escalation

In the first column, higher numbers take priority. i.e. “Say + Intend” is just “Intend” or “Intend + Consult” is just Consult

Parameters:

code: string or int, depending on forward :

Code to be converted

forward: boolean :

Direction of conversion, True = CAMEO -> PICO

Returns:

Forward mode: :

active, passive : int

The two parts of the code [XXX:XXX], converted to the new system. The first is an inherent active meaning, the second is an inherent passive meaning. Both are not always present, most codes just have the active.

utilities.extract_phrases(sent_dict, sent_id)

Text extraction for PETRglobals.WriteActorText and PETRglobals.WriteEventText

Parameters:

story_dict: Dictionary. :

Story-level dictionary as stored in the main event-holding dictionary within PETRARCH.

story_id: String. :

Unique StoryID in standard PETRARCH format.

Returns:

text_dict: Dictionary indexed by event 3-tuple. :

List of texts in the order [source_actor, target_actor, event]

utilities.init_logger(logger_filename)
utilities.nulllist = []

<16.06.27 pas> This might be better placed in PETRtree but I’m leaving it here so that it is clear it is a global. Someone who can better grok recursion than I might also be able to eliminate the need for it.

utilities.parse_to_text(parse)
utilities.story_filter(story_dict, story_id)

One-a-story filter for the events. There can only be only one unique (DATE, SRC, TGT, EVENT) tuple per story.

Parameters:

story_dict: Dictionary. :

Story-level dictionary as stored in the main event-holding dictionary within PETRARCH.

story_id: String. :

Unique StoryID in standard PETRARCH format.

Returns:

filtered: Dictionary. :

Holder for filtered events with the format {(EVENT TUPLE): {‘issues’: [], ‘ids’: []}} where the ‘issues’ list is optional.

PETRtree Module

class PETRtree.NounPhrase(label, date, sentence)

Bases: PETRtree.Phrase

Class specific to noun phrases.

Methods: get_meaning() - specific version of the super’s method
check_date() - find the date-specific version of an actor

Methods

check_date(match)

Method for resolving date restrictions on actor codes.

Parameters:

match: list :

Dates and codes from the dictionary

Returns:

code: string :

The code corresponding to how the actor should be coded given the date

convert_existential()
get_meaning()
get_text()

Noun-specific get text method

return_meaning()
class PETRtree.Phrase(label, date, sentence)

This is a general class for all Phrase instances, which make up the nodes in the syntactic tree. The three subtypes are below.

Methods

get_head()

Method for finding the head of a phrase. The head of a phrase is the rightmost word-level constituent such that the path from root to head consists only of similarly-labeled phrases.

Parameters:

self: Phrase object that called the method :

Returns:

possibilities[-1]: tuple (string,NounPhrase) :

(The text of the head of the phrase, the NounPhrase object whose rightmost child is the head).

get_meaning()

Method for returning the meaning of the subtree rooted by this phrase, is overwritten by all subclasses, so this works primarily for S and S-Bar phrases.

Parameters:

self: Phrase object that called the method :

Returns:

events: list :

Combined meanings of the phrases children

get_parse_string()

recursive rendering of labelled phrase element and children as a string: when called from ROOT it returns the original input string

get_parse_text()

This is a fairly specific debugging function: to recover the original parse, use indented_parse_print(self, level=0) or get_parse_string(self)

get_text()
indented_parse_print(level=0)

recursive print of labeled phrase elements and children with line feeds and indentation

mix_codes(agents, actors)

Combine the actor codes and agent codes addressing duplicates and removing the general “~PPL” if there’s a better option.

Parameters:

agents, actors : Lists of their respective codes

Returns:

codes: list :

[Agent codes] x [Actor codes]

print_to_stdout(indent)
resolve_codes(codes)

Method that divides a list of mixed codes into actor and agent codes

Parameters:

codes: list :

Mixed list of codes

Returns:

actorcodes: list :

List of actor codes

agentcodes: list :

List of actor codes

return_head()
class PETRtree.PrepPhrase(label, date, sentence)

Bases: PETRtree.Phrase

Methods

get_meaning()

Return the meaning of the non-preposition constituent, and store the preposition.

get_prep()
class PETRtree.Sentence(parse, text, date)

Holds the information of a sentence and its tree.

Methods

class PETRtree.VerbPhrase(label, date, sentence)

Bases: PETRtree.Phrase

Subclass specific to Verb Phrases

Methods

__init__: Initialization and Instatiation  
is_valid: Corrects a known stanford error regarding miscoded noun phrases  
get_theme: Returns the coded target of the VP  
get_meaning: Returns event coding described by the verb phrase  
get_lower: Finds meanings of children  
get_upper: Finds grammatical subject  
get_code: Finds base verb code and calls match_pattern  
match_pattern: Matches the tree to a pattern in the Verb Dictionary  
get_S: Finds the closest S-level phrase above the verb  
match_transform: Matches an event code against transformation patterns in the dictionary  
check_passive()
Check if the verb is passive under these conditions:
  1. Verb is -ed form, which is notated by stanford as VBD or VBN
  2. Verb has a form of “be” as its next highest verb
Parameters:

self: VerbPhrase object calling the method :

Returns:

self.passive: boolean :

Whether or not it is passive

get_S()

Navigate up the tree following a VP path to find the closest s-level phrase. There is the extra condition that if the S-level phrase is a “TO”-phrase without a second subject specified, just so that “A wants to help B” will navigate all the way up to “A wants” rather than stopping at “to”

Parameters:

self: VerbPhrase object that called the method :

Returns:

level: VerbPhrase object :

Lowest non-TO S-level phrase object above the verb

get_code()

Match the codes from the Verb Dictionary.

Step 1. Check for compound verb matches

Step 2. Check for pattern matches via match_pattern() method

Parameters:

self: VerbPhrase object that called the method :

Returns:

code: int :

Code described by this verb, best read in hex

get_lower()

Find the meaning of the children of the VP, and whether or not there is a “not” in the VP.

If the VP has VP children, look only at these.

Otherwise, this function pretty much is identical to the NounPhrase.get_meaning() method, except that it doesn’t look at word-level children, because it shouldn’t have any.

Parameters:

self: VerbPhrase object that called the method :

Returns:

self.lower: list :

Actor codes or Event codes, depending on situation

negated: boolean :

Whether a “not” is present

get_meaning()

This determines the event coding of the subtree rooted in this verb phrase.

Four methods are key in this process: get_upper(), get_lower(), get_code() and match_transform().

First, get_meaning() gets the verb code from get_code()

Then, it checks passivity. If the verb is passive, then it looks within verb phrases headed by [by, from, in] for the source, and for an explicit target in verb phrases headed by [at,against,into,towards]. If no target is found, this space is filled with ‘passive’, a flag used later to assign a target if it is in the grammatical subject position.

If the verb is not passive, then the process goes:

1) call get_upper() and get_lower() to check for a grammatical subject and find the coding of the subtree and children, respectively.

2) If get_lower returned a list of events, combine those events with the upper and code, add to event list.

  1. Otherwise, combine upper, lower, and code and add to event list

4) Check to see if there are S-level children, if so, combine with upper and code, add to list.

  1. call match_transform() on all events in the list
Parameters:

self: VerbPhrase object that called the method :

Returns:

events: list :

List of events coded by the subtree rooted in this phrase.

get_theme()

This is used by the NounPhrase.get_meaning() method to determine relevant information in the VerbPhrase.

get_upper()

Finds the meaning of the specifier (NP sibling) of the VP.

Parameters:

self: VerbPhrase object that called the method :

Returns:

self.upper: List :

Actor codes of spec-VP

is_valid()

This method is largely to overcome frequently made Stanford errors, where phrases like “exiled dissidents” were marked as verb phrases, and treating them as such would yield weird parses.

Once such a phrase is identified because of its weird distribution, it is converted to a NounPhrase object

match_pattern()

Match the tree against patterns specified in the dictionary. For a more illustrated explanation of how this process works, see the Petrarch2.pdf file in the documentation.

Parameters:self: VerbPhrase object that called the method :
Returns:False if no match, dict of match if present. :
match_transform(e)

Check to see if the event e follows one of the verb transformation patterns specified at the bottom of the Verb Dictionary file.

If the transformation is present, adjust the event accordingly. If no transformation is present, check if the event is of the form:

a ( b . Q ) P , where Q is not a top-level verb.

and then convert this to ( a b P+Q )

Otherwise, return the event as-is.

Parameters:

e: tuple :

Event to be transformed

Returns:

t: list of tuples :

List of modified events, since multiple events can come from one single event

return_S()
return_code()
return_lower()
return_meaning()
return_passive()
return_upper()