Name: Literature Filtering
Author: AIPOCH

When to Use

You need to quickly narrow a large bibliography by publication year range (e.g., 2015–2024).
You must restrict results to a target journal set (e.g., a whitelist/blacklist of journals).
You are running preliminary screening before full-text review and need traceable inclusion/exclusion decisions.
You are conducting systematic review screening and must record consistent reasons for exclusion.
You need standardized outputs (lists + logs) for collaboration, auditing, or downstream analysis.

Key Features

Rule-based filtering by year, journal, and literature type/criteria.
Journal name normalization to match abbreviations and full names consistently.
Structured recording of exclusion reasons for transparency and reproducibility.
Support for borderline/controversial item review to improve consistency.
Standardized outputs: inclusion list, exclusion list, and screening statistics/summary.

Dependencies

None (documentation-driven workflow).
Optional template file:
- assets/screening_log_template.csv

Example Usage

The following example is a complete, runnable Python script that:

normalizes journal names, 2) filters by year and journal whitelist, 3) applies simple inclusion/exclusion rules, and 4) outputs inclusion/exclusion CSV files plus a screening log.

#!/usr/bin/env python3
import csv
import re
from dataclasses import dataclass
from typing import Dict, List, Tuple

# ----------------------------
# Configuration (edit as needed)
# ----------------------------
YEAR_MIN = 2018
YEAR_MAX = 2024

# Journal whitelist after normalization
JOURNAL_WHITELIST = {
    "journal of finance",
    "journal of financial economics",
    "review of financial studies",
}

# Abbreviation/full-name mapping (extend as needed)
JOURNAL_ALIASES = {
    "j. finan.": "journal of finance",
    "j finan": "journal of finance",
    "jfe": "journal of financial economics",
    "rev. financ. stud.": "review of financial studies",
    "rfs": "review of financial studies",
}

# Simple keyword-based screening rules (example)
INCLUDE_KEYWORDS = {"asset pricing", "corporate finance", "risk premium"}
EXCLUDE_KEYWORDS = {"editorial", "book review", "erratum"}

# ----------------------------
# Data model
# ----------------------------
@dataclass
class Record:
    id: str
    title: str
    year: int
    journal: str
    abstract: str

# ----------------------------
# Helpers
# ----------------------------
def normalize_journal(name: str, aliases: Dict[str, str]) -> str:
    """
    Normalize journal names:
    - lowercase
    - strip punctuation
    - collapse whitespace
    - map abbreviations to canonical full names
    """
    if not name:
        return ""
    raw = name.strip().lower()
    raw = re.sub(r"[^\w\s\.]", " ", raw)  # keep dots for alias keys like "j. finan."
    raw = re.sub(r"\s+", " ", raw).strip()

    # Try alias mapping on the dot-preserved version
    if raw in aliases:
        return aliases[raw]

    # Also try a dot-stripped variant
    nodot = raw.replace(".", "")
    if nodot in aliases:
        return aliases[nodot]

    # Canonicalize by removing dots and extra spaces
    canonical = re.sub(r"[\.]", "", raw)
    canonical = re.sub(r"\s+", " ", canonical).strip()
    return canonical

def contains_any(text: str, keywords: set) -> bool:
    t = (text or "").lower()
    return any(k in t for k in keywords)

def screen_record(r: Record) -> Tuple[bool, str]:
    """
    Returns (included, reason).
    Reasons are designed to be human-auditable.
    """
    if r.year < YEAR_MIN or r.year > YEAR_MAX:
        return False, f"Excluded: year out of range ({r.year})"

    norm_journal = normalize_journal(r.journal, JOURNAL_ALIASES)
    if norm_journal not in JOURNAL_WHITELIST:
        return False, f"Excluded: journal not in whitelist ({norm_journal})"

    text = f"{r.title}\n{r.abstract}"
    if contains_any(text, EXCLUDE_KEYWORDS):
        return False, "Excluded: matches exclusion keywords"

    if not contains_any(text, INCLUDE_KEYWORDS):
        return False, "Excluded: does not match inclusion keywords"

    return True, "Included: meets all criteria"

# ----------------------------
# I/O
# ----------------------------
def read_input_csv(path: str) -> List[Record]:
    """
    Expected columns: id,title,year,journal,abstract
    """
    out = []
    with open(path, "r", newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row in reader:
            out.append(
                Record(
                    id=row.get("id", "").strip(),
                    title=row.get("title", "").strip(),
                    year=int(row.get("year", "0")),
                    journal=row.get("journal", "").strip(),
                    abstract=row.get("abstract", "").strip(),
                )
            )
    return out

def write_csv(path: str, rows: List[Dict[str, str]], fieldnames: List[str]) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        w.writerows(rows)

def main():
    input_path = "input_literature.csv"
    records = read_input_csv(input_path)

    included, excluded, log = [], [], []
    for r in records:
        norm_journal = normalize_journal(r.journal, JOURNAL_ALIASES)
        ok, reason = screen_record(r)

        log.append({
            "id": r.id,
            "title": r.title,
            "year": str(r.year),
            "journal_raw": r.journal,
            "journal_normalized": norm_journal,
            "decision": "include" if ok else "exclude",
            "reason": reason,
        })

        base = {
            "id": r.id,
            "title": r.title,
            "year": str(r.year),
            "journal": norm_journal,
        }
        (included if ok else excluded).append(base)

    write_csv("included.csv", included, ["id", "title", "year", "journal"])
    write_csv("excluded.csv", excluded, ["id", "title", "year", "journal"])
    write_csv(
        "screening_log.csv",
        log,
        ["id", "title", "year", "journal_raw", "journal_normalized", "decision", "reason"],
    )

    # Simple screening statistics
    stats = {
        "total": len(records),
        "included": len(included),
        "excluded": len(excluded),
    }
    print("Screening complete:", stats)
    print("Outputs: included.csv, excluded.csv, screening_log.csv")

if __name__ == "__main__":
    main()

Minimal input file example (input_literature.csv):

id,title,year,journal,abstract
1,Asset Pricing with Risk Premiums,2020,J. Finan.,We study asset pricing and the risk premium...
2,An Editorial Note,2021,Journal of Finance,This editorial summarizes...
3,Corporate Finance Evidence,2017,JFE,Empirical corporate finance results...

Implementation Details

1. Rule Setting

Year rules: define an inclusive range [YEAR_MIN, YEAR_MAX].
Journal rules:
- Use a whitelist (or blacklist) of canonical journal names.
- Apply normalization before matching to avoid false mismatches.
Screening criteria:
- Define explicit inclusion/exclusion criteria (e.g., topic, study type, population, method).
- Ensure each exclusion has a single primary reason (or a controlled multi-reason scheme).

2. Journal Name Normalization

Recommended normalization steps (in order):

Convert to lowercase.
Remove/standardize punctuation and collapse whitespace.
Apply abbreviation/full-name mapping (e.g., J. Finan. → Journal of Finance).
Output a canonical form used for matching and reporting.

Key parameters:

JOURNAL_ALIASES: dictionary for abbreviation/full-name mapping.
Normalization policy choices:
- Case sensitivity (typically disabled by lowercasing).
- Punctuation handling (strip most punctuation; optionally preserve dots for alias keys).
- Whitespace collapsing.

3. Execution of Screening

Apply filters in a stable order to keep decisions consistent and auditable:
1. Year range
2. Journal match (after normalization)
3. Inclusion/exclusion criteria
Record a decision and reason for every record in a screening log.

4. Review and Consistency

Flag borderline items (e.g., unclear abstracts, ambiguous journal names) for manual review.
Keep a shared, versioned rule set (year range, journal list, alias map, criteria) to ensure consistent application across reviewers.

5. Output Organization

Produce at minimum:

included.csv: records that pass all rules.
excluded.csv: records that fail at least one rule.
screening_log.csv: full trace with normalized journal and exclusion reason.
Optional: screening statistics and a reason summary (counts by reason).

Reference formats and checkpoints can be aligned with references/guide.md if available.