When to Use

You have a citation relationship table (who cites whom) and want to quickly turn it into a directed network for analysis.
You are conducting a literature review and need to identify influential papers (high in-degree / centrality) and core clusters.
You want to detect community structures (research subfields) and compare them across time or datasets.
You need an interactive, shareable visualization (HTML) or a Gephi-importable graph file (GEXF).
You are positioning a new project and want evidence of research hotspots and bridging papers between communities.

Key Features

Builds a directed citation graph from a minimal CSV containing source and target.
De-duplicates nodes by identifier (DOI recommended; otherwise unique titles).
Exports:
- citation_network.gexf for Gephi and other graph tools
- network_metrics.json for basic network statistics
- citation_network.html for interactive browser viewing (auto-generated by the build script)
Run-directory workflow to keep each execution reproducible and isolated under outputs/runs/<timestamp>/.
Optional input encoding control to avoid garbled characters (e.g., UTF-8 / UTF-8-SIG).

Dependencies

Python 3.10+
pandas >= 2.0
networkx >= 3.0
(Optional, for HTML visualization) pyvis >= 0.3

Example Usage

1) Initialize a run directory

python scripts/init_run.py

This creates a new run folder:

outputs/runs/<timestamp>/
  config.json
  data/
  outputs/

2) Prepare the citation CSV (minimal)

Create citations.csv and place it into:

outputs/runs/<timestamp>/data/citations.csv

Minimal CSV format:

source,target
Paper A,Paper B
Paper A,Paper C

Recommended DOI-based identifiers:

source,target
10.1234/abcd.1,10.1234/abcd.2
10.1234/abcd.1,10.1234/abcd.3

3) Confirm configuration

Open:

outputs/runs/<timestamp>/config.json

Ensure the configured input filename and column names match your CSV (at minimum source and target). If you see garbled characters, set an explicit encoding (e.g., utf-8 or utf-8-sig) via an input_encoding field if supported by the config.

4) Build the citation network

python scripts/build_citation_network.py

The build script will also generate the HTML automatically (you do not need to run scripts/export_gexf_html.py manually).

5) Inspect outputs

Expected outputs under the same run directory:

citation_network.gexf (import into Gephi)
network_metrics.json (node/edge counts, density, etc.)
citation_network.html (open in a browser)

Implementation Details

Data Model

Nodes: papers, identified by the value in source/target (DOI preferred; otherwise a unique, consistent title string).
Edges: directed citations source -> target.

Input Requirements and Constraints

The network builder reads only the source and target columns.
Additional columns (e.g., author/year/venue) are ignored by the current scripts.
If you need metadata, maintain a separate table for downstream joining/annotation (not consumed by the builder), for example:

id,title,authors,year,doi
10.1234/abcd.1,Paper A,"Zhang, Wei; Li, Ming",2021,10.1234/abcd.1
10.1234/abcd.2,Paper B,"Wang, Fang",2019,10.1234/abcd.2

Run Directory Standard

Always run python scripts/init_run.py before an execution to create a new run directory.
All inputs, configs, and outputs must remain inside outputs/runs/<timestamp>/.
By default, scripts operate on the latest run directory under outputs/runs/.

Metrics and Analysis (Conceptual)

Basic network statistics are exported to network_metrics.json (e.g., node/edge counts, density).
Typical downstream analyses include:
- centrality (degree, betweenness)
- community detection (e.g., Louvain), if enabled/implemented in the pipeline

Common Failure Modes

Garbled characters: ensure CSV is UTF-8/UTF-8-SIG; set input_encoding in config.json if available.
Duplicate nodes: identical identifiers are treated as the same node; prefer DOIs or enforce unique titles.
Empty or missing output: verify the CSV header names match the configured source/target columns.

Data cleaning checklist: references/data-cleaning-checklist.md
Network metrics notes: references/network-metrics-notes.md
Additional documentation: references/README.md

Citation Network