How to Use¤

This guide explains how to run the CLI and how to describe each section of the YAML configuration file.

Run Fast-MIA¤

Create a configuration file (refer to config/sample.yaml)
Run the following command:

uv run --with 'vllm==0.15.1' python main.py --config config/your_own_configuration.yaml

Note: When using T4 GPUs (e.g., Google Colab), set the environment variable to avoid attention backend issues:

VLLM_ATTENTION_BACKEND=XFORMERS uv run --with 'vllm==0.15.1' python main.py --config config/sample.yaml

CLI Arguments¤

Flag	Required	Default	Purpose
`--config`	✅	–	Path to the YAML configuration file.
`--seed`	❌	`42`	Global seed passed to `random`, `torch`, `numpy`, and Python to make evaluations reproducible.
`--max-cache-size`	❌	`1000`	Maximum number of vLLM generations cached across methods; the default is sufficient for most runs.
`--detailed-report`	❌	Off	Generate detailed report with metadata, per-sample scores, and visualizations.

Configuration Files¤

All behavior is driven by a YAML file. Each top-level key toggles a different subsystem:

Key	Purpose
`model`	Required. Parameters are directly forwarded to `vllm.LLM` (model, dtype, etc.).
`sampling_parameters`	Optional. Parameters are directly forwarded to `vllm.SamplingParams`. `prompt_logprobs` defaults to `0`, `max_tokens` defaults to `1`, `temperature` defaults to `0.0`, and `top_p` defaults to `1.0` when omitted.
`lora`	Optional. Parameters are directly forwarded to `vllm.lora.request.LoRARequest`. Omit when no adapter is needed.
`data`	Required. Specifies the dataset source, file format, column names, etc.
`methods`	Required. Ordered list of evaluation methods. Each entry declares a `type` and method-specific `params`.
`output_dir`	Optional. Directory where evaluation CSVs are stored (`./results` by default).

Example Configuration¤

Please refer to config/sample.yaml for a complete example configuration file.

`model` Block¤

Field	Required	Notes
`model_id`	✅	The name or path of a HuggingFace Transformers model that `vllm.LLM` can load. (= model)
Other keys	❌	Forwarded directly to `vllm.LLM`. Select params to fit the model onto your hardware, following the vLLM.LLM API Reference.

`sampling_parameters` Block¤

Values are passed to vllm.SamplingParams. Select params following the vLLM.SamplingParams API Reference. Fast-MIA automatically enforces prompt_logprobs: 0, max_tokens: 1, temperature: 0.0, and top_p: 1.0 whenever the config omits those keys to ensure deterministic, efficient scoring, but you can override any of these defaults by explicitly setting them inside this block.

Recommended defaults for deterministic scoring

Field	Purpose
`max_tokens`	Use `1` to request only the immediate next token; Fast-MIA defaults to this.
`prompt_logprobs`	Use `0` to get prompt text log-probabilities; Fast-MIA defaults to this.
`temperature`	Set to `0.0` for deterministic runs; Fast-MIA defaults to this.
`top_p`	Leave at `1.0` so determinism is governed solely by `temperature`; Fast-MIA defaults to this.

`lora` Block (Optional)¤

Values are passed to vllm.lora.request.LoRARequest. Select params following the vLLM.lora.request.LoRARequest API Reference.

Field	Purpose
`lora_name`	Human-readable adapter name (used in logs).
`lora_int_id`	Integer identifier that vLLM uses to cache the adapter. Use different IDs for simultaneous adapters.
`lora_path`	Filesystem path to the adapter weights.

Omit the entire block if you evaluate the base model. When enabled, the adapter is transparently applied to all methods.

Note: If you enable LoRA, there is a known prompt-logprob bug (https://discuss.vllm.ai/t/bug-wrong-lora-mapping-during-prompt-logprobs-computing/500/2). Setting sampling_parameters.prompt_logprobs currently raises an error, so methods like loss or mink cannot run.

`data` Block¤

Field	Required	Default	Description
`data_path`	✅	–	Path to a local file or one of the dedicated Hugging Face datasets (`swj0419/WikiMIA` or `iamgroot42/mimir_{domain}_{ngram}`).
`format`	❌	`csv`	One of `csv`, `jsonl`, `json`, `parquet`. Use `huggingface` only in combination with `swj0419/WikiMIA` or `iamgroot42/mimir_{domain}_{ngram}`.
`text_column`	❌	`text`	Column containing the raw text to probe.
`label_column`	❌	`label`	Column containing membership labels (`1` = member, `0` = non-member).
`text_length`	❌	`32`	Number of words kept from each sample. WikiMIA requires one of `32`, `64`, `128`, `256`; MIMIR requires `200`.
`space_delimited_language`	❌	`true`	Set to `false` for languages without whitespace like Japanese.

Note: If space_delimited_language is false, you must pre-process your text and insert spaces between words beforehand; Fast-MIA assumes the input is already space-separated and will simply strip the spaces back out during scoring.

Supported File Formats¤

Format	Reader
`csv`	`pandas.read_csv(data_path)`
`jsonl`	`pandas.read_json(data_path, lines=True)`
`json`	`pandas.read_json(data_path)`
`parquet`	`pandas.read_parquet(data_path)`
`huggingface`	Not available for arbitrary datasets. Use only with the dedicated WikiMIA/MIMIR loaders described below.

Supported Hugging Face Datasets¤

Currently, only the following datasets are supported via the huggingface format.

The WikiMIA dataset is handled specially. If you set data_path to "swj0419/WikiMIA", it will be automatically recognized. For this dataset, the data corresponding to the specified text length (32, 64, 128, or 256) will be automatically loaded (e.g., "WikiMIA_length64").
The MIMIR dataset is handled specially too. If you set data_path to "iamgroot42/mimir_{domain}_{ngram}", it will be automatically recognized. For this dataset, the data corresponding to the specified domain and ngram will be automatically loaded (e.g., "iamgroot42/mimir", "pile_cc", "ngram_7_0.2").

Note: To use the MIMIR dataset, you need to create a .env file in the project root directory with your Hugging Face token. Create a .env file with the following content:

HUGGINGFACE_TOKEN=your_huggingface_token_here

Replace your_huggingface_token_here with your actual Hugging Face token. You can obtain a token from Hugging Face Settings.

`methods` Block¤

Declare any number of methods. Each entry looks like:

- type: "method_name"
  params:
    # method-specific keys

Available method types and their parameters:

Method Name	Description	Parameters
`loss`	Uses the model's loss	–
`zlib`	Uses the ratio of information content calculated by Zlib compression	–
`ref`	Uses the difference in loss between the target model and a reference model	`reference_model` (required. Model configuration for the reference model, same fields as the `model` block).
`lower`	Uses the ratio of loss after lowercasing the text	–
`mink`	https://github.com/swj0419/detect-pretrain-code	`ratio` (`0.0–1.0`, default `0.5`).
`pac`	https://github.com/yyy01/PAC	`alpha` (augmentation strength, default `0.3`), `N` (augmentations per sample, default `5`).
`recall`	https://github.com/ruoyuxie/recall	`num_shots` (number of prefix texts, default `10`), `pass_window` (skip max-length trimming, default `False`).
`conrecall`	https://github.com/WangCheng0116/CON-RECALL	Same as `recall` plus `gamma` (ratio of member prefixes loss, default `0.5`).
`samia`	https://github.com/nlp-titech/samia	`num_samples` (number of samples, default `5`), `prefix_ratio` (ratio of prefix, default `0.5`), `zlib` (Use Zlib, default `True`).
`dcpdd`	https://github.com/zhang-wei-chao/DC-PDD	`file_num` (number of C4 text files, default `15`), `max_token_length` (max token length, default `1024`), `alpha` (upper bound of score, default `0.01`).

`output_dir`¤

Directory where CSV results are written. The folder is created if missing.

Output Files¤

Default Output¤

By default, Fast-MIA saves the following files in a timestamped folder:

results/
└── YYYYMMDD-HHMMSS/
    ├── config.yaml    # Copy of the configuration used
    ├── results.csv    # Summary metrics (AUROC, FPR@95, TPR@5)
    └── report.txt     # Human-readable summary report

Detailed Report Mode¤

When you run with --detailed-report, Fast-MIA generates additional outputs for benchmarking and analysis:

uv run --with 'vllm==0.15.1' python main.py --config config/sample.yaml --detailed-report

Output Structure¤

results/
└── YYYYMMDD-HHMMSS/           # Timestamped folder for each run
    ├── config.yaml            # Copy of the configuration used
    ├── results.csv            # Summary metrics (AUROC, FPR@95, TPR@5)
    ├── detailed_scores.csv    # Per-sample scores for each method
    ├── metadata.json          # Execution metadata (JSON format)
    ├── metadata.yaml          # Execution metadata (YAML format)
    ├── report.txt             # Human-readable summary report
    └── figures/
        ├── roc_curves.png         # ROC curves for all methods
        ├── score_distributions.png # Score histograms (member vs non-member)
        └── metrics_comparison.png  # Bar chart comparing metrics

Metadata Contents¤

The metadata files (metadata.json / metadata.yaml) include:

Section	Contents
`experiment`	Start/end time, duration
`environment`	Python version, platform, hostname
`git`	Commit hash, branch, dirty status
`model`	Model ID, parameters
`data`	Dataset path, format, sample counts
`sampling_parameters`	vLLM sampling configuration
`methods`	List of evaluated methods with parameters
`cache`	Cache hit/miss statistics

Detailed Scores¤

The detailed_scores.csv file contains per-sample scores:

label,loss,zlib,mink_0.2,recall
1,0.832,0.654,0.721,0.891
0,0.234,0.312,0.287,0.156
...

This allows for post-hoc analysis, custom visualizations, or statistical testing.