🧬

Bio Research Nextflow Development

Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

by @anthropics · Apache 2.0 · 10.9k

Built for: Researchers

What this skill does

Automate professional-grade analysis of genetic sequencing data to produce clear results like gene expression levels or DNA variant calls without manual coding. You can process your own experimental files or automatically fetch and analyze public datasets from online repositories using validated research workflows. Reach for this tool when you need to transform raw data into biological insights quickly and reliably.

Anthropic · Research

view on github ↗

name: nextflow-development description: Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

nf-core Pipeline Deployment

Run nf-core bioinformatics pipelines on local or public sequencing data.

Target users: Bench scientists and researchers without specialized bioinformatics training who need to run large-scale omics analyses—differential expression, variant calling, or chromatin accessibility analysis.

Workflow Checklist

- [ ] Step 0: Acquire data (if from GEO/SRA)
- [ ] Step 1: Environment check (MUST pass)
- [ ] Step 2: Select pipeline (confirm with user)
- [ ] Step 3: Run test profile (MUST pass)
- [ ] Step 4: Create samplesheet
- [ ] Step 5: Configure & run (confirm genome with user)
- [ ] Step 6: Verify outputs

Step 0: Acquire Data (GEO/SRA Only)

Skip this step if user has local FASTQ files.

For public datasets, fetch from GEO/SRA first. See references/geo-sra-acquisition.md for the full workflow.

Quick start:

# 1. Get study info
python scripts/sra_geo_fetch.py info GSE110004

# 2. Download (interactive mode)
python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i

# 3. Generate samplesheet
python scripts/sra_geo_fetch.py samplesheet GSE110004 --fastq-dir ./fastq -o samplesheet.csv

DECISION POINT: After fetching study info, confirm with user:

Which sample subset to download (if multiple data types)
Suggested genome and pipeline

Then continue to Step 1.

Step 1: Environment Check

Run first. Pipeline will fail without passing environment.

python scripts/check_environment.py

All critical checks must pass. If any fail, provide fix instructions:

Docker issues

Problem	Fix
Not installed	Install from https://docs.docker.com/get-docker/
Permission denied	`sudo usermod -aG docker $USER` then re-login
Daemon not running	`sudo systemctl start docker`

Nextflow issues

Problem	Fix
Not installed	`curl -s https://get.nextflow.io \| bash && mv nextflow ~/bin/`
Version < 23.04	`nextflow self-update`

Java issues

Problem	Fix
Not installed / < 11	`sudo apt install openjdk-11-jdk`

Do not proceed until all checks pass. For HPC/Singularity, see references/troubleshooting.md.

Step 2: Select Pipeline

DECISION POINT: Confirm with user before proceeding.

Data Type	Pipeline	Version	Goal
RNA-seq	`rnaseq`	3.22.2	Gene expression
WGS/WES	`sarek`	3.7.1	Variant calling
ATAC-seq	`atacseq`	2.1.2	Chromatin accessibility

Auto-detect from data:

python scripts/detect_data_type.py /path/to/data

For pipeline-specific details:

Step 3: Run Test Profile

Validates environment with small data. MUST pass before real data.

nextflow run nf-core/<pipeline> -r <version> -profile test,docker --outdir test_output

Pipeline	Command
rnaseq	`nextflow run nf-core/rnaseq -r 3.22.2 -profile test,docker --outdir test_rnaseq`
sarek	`nextflow run nf-core/sarek -r 3.7.1 -profile test,docker --outdir test_sarek`
atacseq	`nextflow run nf-core/atacseq -r 2.1.2 -profile test,docker --outdir test_atacseq`

Verify:

ls test_output/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log

If test fails, see references/troubleshooting.md.

Step 4: Create Samplesheet

Generate automatically

python scripts/generate_samplesheet.py /path/to/data <pipeline> -o samplesheet.csv

The script:

Discovers FASTQ/BAM/CRAM files
Pairs R1/R2 reads
Infers sample metadata
Validates before writing

For sarek: Script prompts for tumor/normal status if not auto-detected.

Validate existing samplesheet

python scripts/generate_samplesheet.py --validate samplesheet.csv <pipeline>

Samplesheet formats

rnaseq:

sample,fastq_1,fastq_2,strandedness
SAMPLE1,/abs/path/R1.fq.gz,/abs/path/R2.fq.gz,auto

sarek:

patient,sample,lane,fastq_1,fastq_2,status
patient1,tumor,L001,/abs/path/tumor_R1.fq.gz,/abs/path/tumor_R2.fq.gz,1
patient1,normal,L001,/abs/path/normal_R1.fq.gz,/abs/path/normal_R2.fq.gz,0

atacseq:

sample,fastq_1,fastq_2,replicate
CONTROL,/abs/path/ctrl_R1.fq.gz,/abs/path/ctrl_R2.fq.gz,1

Step 5: Configure & Run

5a. Check genome availability

python scripts/manage_genomes.py check <genome>
# If not installed:
python scripts/manage_genomes.py download <genome>

Common genomes: GRCh38 (human), GRCh37 (legacy), GRCm39 (mouse), R64-1-1 (yeast), BDGP6 (fly)

5b. Decision points

DECISION POINT: Confirm with user:

Genome: Which reference to use
Pipeline-specific options:
- rnaseq: aligner (star_salmon recommended, hisat2 for low memory)
- sarek: tools (haplotypecaller for germline, mutect2 for somatic)
- atacseq: read_length (50, 75, 100, or 150)

5c. Run pipeline

nextflow run nf-core/<pipeline> \
    -r <version> \
    -profile docker \
    --input samplesheet.csv \
    --outdir results \
    --genome <genome> \
    -resume

Key flags:

-r: Pin version
-profile docker: Use Docker (or singularity for HPC)
--genome: iGenomes key
-resume: Continue from checkpoint

Resource limits (if needed):

--max_cpus 8 --max_memory '32.GB' --max_time '24.h'

Step 6: Verify Outputs

Check completion

ls results/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log

Key outputs by pipeline

rnaseq:

results/star_salmon/salmon.merged.gene_counts.tsv - Gene counts
results/star_salmon/salmon.merged.gene_tpm.tsv - TPM values

sarek:

results/variant_calling/*/ - VCF files
results/preprocessing/recalibrated/ - BAM files

atacseq:

results/macs2/narrowPeak/ - Peak calls
results/bwa/mergedLibrary/bigwig/ - Coverage tracks

Quick Reference

For common exit codes and fixes, see references/troubleshooting.md.

Resume failed run

nextflow run nf-core/<pipeline> -resume

References

references/geo-sra-acquisition.md - Downloading public GEO/SRA data
references/troubleshooting.md - Common issues and fixes
references/installation.md - Environment setup
references/pipelines/rnaseq.md - RNA-seq pipeline details
references/pipelines/sarek.md - Variant calling details
references/pipelines/atacseq.md - ATAC-seq details

Disclaimer

This skill is provided as a prototype example demonstrating how to integrate nf-core bioinformatics pipelines into Claude Code for automated analysis workflows. The current implementation supports three pipelines (rnaseq, sarek, and atacseq), serving as a foundation that enables the community to expand support to the full set of nf-core pipelines.

It is intended for educational and research purposes and should not be considered production-ready without appropriate validation for your specific use case. Users are responsible for ensuring their computing environment meets pipeline requirements and for verifying analysis results.

Anthropic does not guarantee the accuracy of bioinformatics outputs, and users should follow standard practices for validating computational analyses. This integration is not officially endorsed by or affiliated with the nf-core community.

Attribution

When publishing results, cite the appropriate pipeline. Citations are available in each nf-core repository’s CITATIONS.md file (e.g., https://github.com/nf-core/rnaseq/blob/3.22.2/CITATIONS.md).

Licenses

nf-core pipelines: MIT License (https://nf-co.re/about)
Nextflow: Apache License, Version 2.0 (https://www.nextflow.io/about-us.html)
NCBI SRA Toolkit: Public Domain (https://github.com/ncbi/sra-tools/blob/master/LICENSE)

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

1. Definitions.

"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.

"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.

"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.

"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.

"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.

"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.

"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).

"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.

"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."

"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.

2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.

3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.

4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:

(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and

(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and

(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and

(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.

You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.

5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.

6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.

7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.

8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.

9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

GEO/SRA Data Acquisition

Download raw sequencing data from NCBI GEO/SRA and prepare it for nf-core pipelines.

Use this when: Reanalyzing published datasets, validating findings, or comparing results against public cohorts.

Workflow Overview
Step 1: Fetch Study Information
Step 2: Review Sample Groups
Step 3: Download FASTQ Files
Step 4: Generate Samplesheet
Step 5: Run nf-core Pipeline
Supported Pipelines
Supported Organisms
Complete Example
Troubleshooting

Workflow Overview

Example: "Find differentially expressed genes in GSE309891 (drug-treated vs control)"

┌─────────────────────────────────────────────────────────────────┐
│                    GEO/SRA DATA ACQUISITION                     │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                 ┌────────────────────────┐
                 │   Fetch study info     │
                 │   • Query NCBI/SRA     │
                 │   • Get metadata       │
                 │   • Detect organism    │
                 │   • Identify data type │
                 └────────────────────────┘
                              │
                              ▼
                 ┌────────────────────────┐
                 │   Present summary      │
                 │   • Organism: Human    │
                 │   • Genome: GRCh38     │
                 │   • Type: RNA-Seq      │
                 │   • Pipeline: rnaseq   │
                 │   • Samples: 12        │
                 │     (6 treated,        │
                 │      6 control)        │
                 │   • Size: ~24 GB       │
                 └────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  USER CONFIRMS  │◄──── Decision point
                    │  genome/pipeline│
                    └─────────────────┘
                              │
                              ▼
                 ┌────────────────────────┐
                 │   Select samples       │
                 │   • Group by condition │
                 │   • Show treated/ctrl  │
                 └────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  USER SELECTS   │◄──── Decision point
                    │  sample subset  │
                    └─────────────────┘
                              │
                              ▼
                 ┌────────────────────────┐
                 │   Download FASTQs      │
                 │   • 24 files (R1+R2)   │
                 │   • Parallel transfers │
                 │   • Auto-resume        │
                 └────────────────────────┘
                              │
                              ▼
                 ┌────────────────────────┐
                 │   Generate samplesheet │
                 │   • Map SRR to files   │
                 │   • Pair R1/R2         │
                 │   • Assign conditions  │
                 └────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                    NF-CORE PIPELINE EXECUTION                   │
│              (Continue with Step 1 of main workflow)            │
└─────────────────────────────────────────────────────────────────┘

Instructions for Claude

When assisting users with GEO/SRA data acquisition:

Always fetch study info first to show the user what data is available
Ask for confirmation before downloading - Present the sample groups and sizes, then ask which subset to download using AskUserQuestion
Suggest appropriate genome and pipeline based on the organism and data type
Return to main SKILL.md workflow after data preparation is complete

Example confirmation question:

Question: "Which sample group would you like to download?"
Options:
  - "RNA-Seq:PAIRED (42 samples, ~87 GB)"
  - "RNA-Seq:SINGLE (7 samples, ~4.5 GB)"
  - "All samples (49 samples, ~92 GB)"

Step 1: Fetch Study Information

Get metadata about a GEO study before downloading.

python scripts/sra_geo_fetch.py info <GEO_ID>

Example:

python scripts/sra_geo_fetch.py info GSE110004

Output includes:

Study title and summary
Organism (with auto-suggested genome)
Number of samples and runs
Data types (RNA-Seq, ATAC-seq, etc.)
Estimated download size
Suggested nf-core pipeline

Save info to JSON:

python scripts/sra_geo_fetch.py info GSE110004 -o study_info.json

Step 2: Review Sample Groups

View sample groups organized by data type and layout. This is useful for studies with mixed data types.

python scripts/sra_geo_fetch.py groups <GEO_ID>

Example output:

Sample Group          Count Layout     GSM Range                    Est. Size
--------------------------------------------------------------------------------
RNA-Seq                  42 PAIRED     GSM2879618...(42 samples)      87.4 GB
RNA-Seq                   7 SINGLE     GSM2976181-GSM2976187           4.5 GB
--------------------------------------------------------------------------------
TOTAL                    49                                           91.9 GB

Available groups for --subset option:
  1. "RNA-Seq:PAIRED" - 42 samples (~87.4 GB)
  2. "RNA-Seq:SINGLE" - 7 samples (~4.5 GB)

List individual runs:

python scripts/sra_geo_fetch.py list <GEO_ID>

# Filter by data type
python scripts/sra_geo_fetch.py list GSE110004 --filter "RNA-Seq:PAIRED"

DECISION POINT: Review the sample groups. Decide which subset to download if the study has multiple data types.

Step 3: Download FASTQ Files

Download FASTQ files from ENA (faster than SRA).

python scripts/sra_geo_fetch.py download <GEO_ID> -o <OUTPUT_DIR>

Options:

-o, --output: Output directory (required)
-i, --interactive: Interactively select sample group to download
-s, --subset: Filter by data type (e.g., "RNA-Seq:PAIRED")
-p, --parallel: Parallel downloads (default: 4)
-t, --timeout: Download timeout in seconds (default: 600)

Interactive Mode (Recommended)

Use -i flag for interactive sample selection when the study has multiple data types:

python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i

Interactive output:

============================================================
  SELECT SAMPLE GROUP TO DOWNLOAD
============================================================

  [1] RNA-Seq (paired)
      Samples: 42
      GSM: GSM2879618...(42 samples)
      Size: ~87.4 GB

  [2] RNA-Seq (single)
      Samples: 7
      GSM: GSM2976181-GSM2976187
      Size: ~4.5 GB

  [0] Download ALL (49 samples)
------------------------------------------------------------

Enter selection (0-2):

Direct Subset Selection

Alternatively, specify the subset directly:

# Download only RNA-Seq paired-end data
python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq \
    --subset "RNA-Seq:PAIRED" --parallel 6

Note: Downloads automatically skip existing files. Resume interrupted downloads by re-running the command.

Step 4: Generate Samplesheet

Create a samplesheet compatible with nf-core pipelines.

python scripts/sra_geo_fetch.py samplesheet <GEO_ID> \
    --fastq-dir <FASTQ_DIR> \
    -o samplesheet.csv

Options:

-f, --fastq-dir: Directory containing downloaded FASTQ files (required)
-o, --output: Output samplesheet path (default: samplesheet.csv)
-p, --pipeline: Target pipeline (auto-detected if not specified)

Example:

python scripts/sra_geo_fetch.py samplesheet GSE110004 \
    --fastq-dir ./fastq \
    -o samplesheet.csv

Output: The script will:

Create samplesheet in the format required by the target pipeline
Display suggested genome reference
Show suggested nf-core command

Step 5: Run nf-core Pipeline

After generating the samplesheet, the script provides a suggested command.

Example output:

Suggested command:
   nextflow run nf-core/rnaseq \
       --input samplesheet.csv \
       --outdir results \
       --genome R64-1-1 \
       -profile docker

DECISION POINT: Review and confirm:

Is the suggested pipeline correct?
Is the genome reference correct for your organism?
Do you need additional pipeline options?

Then return to the main SKILL.md workflow (Step 1: Environment Check) to proceed with pipeline execution.

Supported Pipelines

The skill auto-detects appropriate pipelines based on library strategy. Pipelines marked with ★ are fully supported with configs, samplesheet generation, and documentation. Others are suggested but require manual setup following nf-core documentation.

Library Strategy	Suggested Pipeline	Support
RNA-Seq	nf-core/rnaseq	★ Full
ATAC-seq	nf-core/atacseq	★ Full
WGS/WXS	nf-core/sarek	★ Full
ChIP-seq	nf-core/chipseq	Manual
Bisulfite-Seq	nf-core/methylseq	Manual
miRNA-Seq	nf-core/smrnaseq	Manual
Amplicon	nf-core/ampliseq	Manual

Supported Organisms

Common organisms with auto-suggested genomes:

Organism	Genome	Notes
Homo sapiens	GRCh38	Human reference
Mus musculus	GRCm39	Mouse reference
Saccharomyces cerevisiae	R64-1-1	Yeast S288C
Drosophila melanogaster	BDGP6	Fruit fly
Caenorhabditis elegans	WBcel235	C. elegans
Danio rerio	GRCz11	Zebrafish
Arabidopsis thaliana	TAIR10	Arabidopsis
Rattus norvegicus	Rnor_6.0	Rat

See scripts/config/genomes.yaml for the full list.

Complete Example

Reanalyze GSE110004 (yeast RNA-seq):

# 1. Get study info and sample groups
python scripts/sra_geo_fetch.py info GSE110004

# 2. Download with interactive selection
python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i
# Select option [1] for RNA-Seq paired-end samples

# 3. Generate samplesheet
python scripts/sra_geo_fetch.py samplesheet GSE110004 \
    --fastq-dir ./fastq \
    -o samplesheet.csv

# 4. Run nf-core/rnaseq (continue with main SKILL.md workflow)
nextflow run nf-core/rnaseq \
    --input samplesheet.csv \
    --outdir results \
    --genome R64-1-1 \
    -profile docker

Alternative: Non-interactive Download

# Review sample groups first
python scripts/sra_geo_fetch.py groups GSE110004

# Download specific subset directly
python scripts/sra_geo_fetch.py download GSE110004 \
    --subset "RNA-Seq:PAIRED" \
    -o ./fastq \
    --parallel 4

Troubleshooting

ENA Download Fails

If ENA downloads fail, the data may need to be fetched directly from SRA:

# Create SRA tools environment
conda create -n sra_tools -c bioconda sra-tools

# Download with prefetch + fasterq-dump
conda run -n sra_tools prefetch SRR6357070
conda run -n sra_tools fasterq-dump SRR6357070 -O ./fastq

No SRA Runs Found

Some GEO datasets only have processed data, not raw sequencing reads. Check:

python scripts/sra_geo_fetch.py info <GEO_ID>

If "Runs: 0", the dataset may not have raw data in SRA.

SuperSeries Support

GEO SuperSeries (which contain multiple SubSeries) are automatically handled. The tool will:

Detect that a GEO ID is a SuperSeries
Find the linked BioProject accession
Fetch all SRA runs from the BioProject

Example: GSE110004 is a SuperSeries that links to BioProject PRJNA432544.

Genome Not Recognized

If the organism is not in the genome mapping, manually specify the genome:

# Check available iGenomes
python scripts/manage_genomes.py list

# Or provide custom reference files to nf-core
nextflow run nf-core/rnaseq --fasta /path/to/genome.fa --gtf /path/to/genes.gtf

Requirements

Python 3.8+
requests library (optional but recommended)
pyyaml library (optional, for genome config)
Network access to NCBI and ENA

Install optional dependencies:

pip install requests pyyaml

Installation

Quick install
Docker setup
Singularity setup (HPC)
nf-core tools (optional)
Verify installation
Common issues

Quick install

# Nextflow
curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/
export PATH="$HOME/bin:$PATH"

# Verify
nextflow -version
java -version  # Requires 11+

Docker setup

Linux

sudo apt-get update && sudo apt-get install docker.io
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
# Log out and back in

macOS

Download Docker Desktop: https://docker.com/products/docker-desktop

Verify

docker run hello-world

Singularity setup (HPC)

# Ubuntu/Debian
sudo apt-get install singularity-container

# Or via conda
conda install -c conda-forge singularity

Configure cache

export NXF_SINGULARITY_CACHEDIR="$HOME/.singularity/cache"
mkdir -p $NXF_SINGULARITY_CACHEDIR
echo 'export NXF_SINGULARITY_CACHEDIR="$HOME/.singularity/cache"' >> ~/.bashrc

nf-core tools (optional)

pip install nf-core

Useful commands:

nf-core list                    # Available pipelines
nf-core launch rnaseq           # Interactive parameter selection
nf-core download rnaseq -r 3.14.0  # Download for offline use

Verify installation

nextflow run nf-core/demo -profile test,docker --outdir test_demo
ls test_demo/

Common issues

Java version wrong:

export JAVA_HOME=/path/to/java11

Docker permission denied:

sudo usermod -aG docker $USER
# Log out and back in

Nextflow not found:

echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

Troubleshooting

Quick fixes for common nf-core pipeline issues.

Exit Codes
HPC/Singularity Issues
Pipeline Failures
RNA-seq Specific
Sarek Specific
ATAC-seq Specific
Resource Management
Getting Help

Exit Codes

Common exit codes indicating resource issues (per nf-core docs):

Code	Cause	Fix
137	Out of memory	`--max_memory '32.GB'` or `'64.GB'` for WGS
143	Out of memory	`--max_memory '32.GB'` or `'64.GB'` for WGS
104, 134, 139, 247	Out of memory	Increase `--max_memory`
1	General error	Check `.nextflow.log` for details

Most pipelines auto-retry with 2x then 3x resources before failing.

HPC/Singularity Issues

Singularity cache issues

export NXF_SINGULARITY_CACHEDIR="$HOME/.singularity/cache"
mkdir -p $NXF_SINGULARITY_CACHEDIR

Using Singularity instead of Docker

On HPC systems without Docker, use Singularity:

nextflow run nf-core/<pipeline> -profile singularity ...

Note: For basic environment setup (Docker, Nextflow, Java installation), see the inline instructions in Step 1 of SKILL.md.

Pipeline Failures

Container pull failed

Check network connectivity
Try: -profile singularity instead of docker
For offline: nf-core download <pipeline> -r <version>

"No such file" errors

Use absolute paths in samplesheet
Verify files exist: ls /path/to/file

Resume not working

# Check work directory exists
ls -la work/

# Force clean restart (loses cache)
rm -rf work/ .nextflow*
nextflow run nf-core/<pipeline> ...

RNA-seq Specific

STAR index fails

Increase memory: --max_memory '64.GB'
Or provide pre-built: --star_index /path/to/star/

Low alignment rate

Verify genome matches species
Check FastQC for adapter contamination
Try different aligner: --aligner hisat2

Strandedness detection fails

Specify explicitly: --strandedness reverse
Common values: forward, reverse, unstranded

Sarek Specific

BQSR fails

Check known sites for genome
Skip for non-standard references: --skip_bqsr

Mutect2 no variants

Verify tumor/normal pairing
Check samplesheet status column: 0=normal, 1=tumor

Out of memory for WGS

--max_memory '128.GB' --max_cpus 16

DeepVariant GPU issues

Ensure NVIDIA Docker runtime configured
Or use CPU mode (slower)

ATAC-seq Specific

Low FRiP score

Check library complexity in plotFingerprint/
May indicate over-transposition

Few peaks called

Lower threshold: --macs_qvalue 0.1
Use broad peaks: --narrow_peak false

High duplicates

Normal for low-input samples
Pipeline removes by default
Consider deeper sequencing

Resource Management

Set resource limits

--max_cpus 8 --max_memory '32.GB' --max_time '24.h'

Check available resources

# CPUs
nproc

# Memory
free -h

# Disk
df -h .

Getting Help

Check .nextflow.log for error details
Search nf-core Slack: https://nf-co.re/join
Open issue on GitHub: https://github.com/nf-core//issues

#!/usr/bin/env python3
"""
Pre-flight environment validation for nf-core pipelines.

Checks Docker, Nextflow, Java, system resources, and network connectivity.
Run this BEFORE attempting any pipeline execution.

Usage:
    python check_environment.py
    python check_environment.py --json
"""

import json
import os
import shutil
import subprocess
import sys
from dataclasses import dataclass, field, asdict
from typing import List, Optional


@dataclass
class CheckResult:
    """Result of a single environment check."""
    name: str
    passed: bool
    message: str
    details: Optional[str] = None
    fix: Optional[str] = None


@dataclass
class EnvironmentReport:
    """Complete environment validation report."""
    ready: bool
    checks: List[CheckResult] = field(default_factory=list)
    recommendations: List[str] = field(default_factory=list)

    def to_dict(self):
        return {
            "ready": self.ready,
            "checks": [asdict(c) for c in self.checks],
            "recommendations": self.recommendations
        }


def check_docker() -> CheckResult:
    """Check Docker availability, daemon status, and permissions."""
    if not shutil.which("docker"):
        return CheckResult(
            name="Docker",
            passed=False,
            message="Docker not found in PATH",
            fix="Install Docker: https://docs.docker.com/get-docker/"
        )

    try:
        result = subprocess.run(
            ["docker", "info"],
            capture_output=True,
            text=True,
            timeout=15
        )

        if result.returncode != 0:
            stderr_lower = result.stderr.lower()
            if "permission denied" in stderr_lower:
                return CheckResult(
                    name="Docker",
                    passed=False,
                    message="Docker permission denied",
                    details="Cannot connect to Docker daemon",
                    fix="sudo usermod -aG docker $USER && newgrp docker"
                )
            elif "cannot connect" in stderr_lower or "is the docker daemon running" in stderr_lower:
                return CheckResult(
                    name="Docker",
                    passed=False,
                    message="Docker daemon not running",
                    details=result.stderr[:200] if result.stderr else None,
                    fix="sudo systemctl start docker"
                )
            else:
                return CheckResult(
                    name="Docker",
                    passed=False,
                    message="Docker error",
                    details=result.stderr[:200] if result.stderr else None,
                    fix="Check Docker installation and daemon status"
                )

        return CheckResult(
            name="Docker",
            passed=True,
            message="Docker is available and running"
        )

    except subprocess.TimeoutExpired:
        return CheckResult(
            name="Docker",
            passed=False,
            message="Docker command timed out",
            fix="Check Docker daemon status: sudo systemctl status docker"
        )
    except Exception as e:
        return CheckResult(
            name="Docker",
            passed=False,
            message=f"Docker check failed: {str(e)}"
        )


def check_nextflow() -> CheckResult:
    """Check Nextflow installation and version (requires >= 23.04)."""
    if not shutil.which("nextflow"):
        return CheckResult(
            name="Nextflow",
            passed=False,
            message="Nextflow not found in PATH",
            fix="curl -s https://get.nextflow.io | bash && mv nextflow ~/bin/ && export PATH=$HOME/bin:$PATH"
        )

    try:
        result = subprocess.run(
            ["nextflow", "-version"],
            capture_output=True,
            text=True,
            timeout=30
        )

        output = result.stdout + result.stderr
        version_line = output.strip().split('\n')[0] if output else ""

        import re
        match = re.search(r'(\d+)\.(\d+)\.(\d+)', version_line)

        if match:
            major, minor, patch = int(match.group(1)), int(match.group(2)), int(match.group(3))
            version_str = f"{major}.{minor}.{patch}"

            # Require version >= 23.04
            if major > 23 or (major == 23 and minor >= 4):
                return CheckResult(
                    name="Nextflow",
                    passed=True,
                    message=f"Nextflow {version_str} installed",
                    details=version_line
                )
            else:
                return CheckResult(
                    name="Nextflow",
                    passed=False,
                    message=f"Nextflow {version_str} is outdated (requires >= 23.04)",
                    details=version_line,
                    fix="nextflow self-update"
                )

        return CheckResult(
            name="Nextflow",
            passed=True,
            message="Nextflow installed (version unknown)",
            details=version_line
        )

    except subprocess.TimeoutExpired:
        return CheckResult(
            name="Nextflow",
            passed=False,
            message="Nextflow command timed out",
            fix="Check Nextflow installation"
        )
    except Exception as e:
        return CheckResult(
            name="Nextflow",
            passed=False,
            message=f"Nextflow check failed: {str(e)}"
        )


def check_java() -> CheckResult:
    """Check Java version (requires >= 11)."""
    if not shutil.which("java"):
        return CheckResult(
            name="Java",
            passed=False,
            message="Java not found in PATH",
            fix="Install Java 11+: sudo apt install openjdk-11-jdk"
        )

    try:
        result = subprocess.run(
            ["java", "-version"],
            capture_output=True,
            text=True,
            timeout=10
        )

        # Java version is typically in stderr
        output = result.stderr or result.stdout
        import re
        match = re.search(r'version "(\d+)', output)

        if match:
            version = int(match.group(1))
            version_line = output.strip().split('\n')[0]

            if version >= 11:
                return CheckResult(
                    name="Java",
                    passed=True,
                    message=f"Java {version} installed",
                    details=version_line
                )
            else:
                return CheckResult(
                    name="Java",
                    passed=False,
                    message=f"Java {version} is too old (requires >= 11)",
                    details=version_line,
                    fix="Install Java 11+: sudo apt install openjdk-11-jdk"
                )

        return CheckResult(
            name="Java",
            passed=True,
            message="Java installed",
            details=output.strip().split('\n')[0] if output else None
        )

    except Exception as e:
        return CheckResult(
            name="Java",
            passed=False,
            message=f"Java check failed: {str(e)}"
        )


def check_resources() -> CheckResult:
    """Check system resources (CPU, memory, disk)."""
    try:
        # CPU cores
        cpu_count = os.cpu_count() or 1

        # Memory
        mem_gb = 0
        try:
            # Linux: read from /proc/meminfo
            with open('/proc/meminfo', 'r') as f:
                for line in f:
                    if line.startswith('MemTotal:'):
                        mem_kb = int(line.split()[1])
                        mem_gb = mem_kb / (1024 * 1024)
                        break
        except (FileNotFoundError, PermissionError):
            # macOS: use sysctl
            try:
                result = subprocess.run(
                    ['sysctl', '-n', 'hw.memsize'],
                    capture_output=True, text=True, timeout=5
                )
                if result.returncode == 0:
                    mem_gb = int(result.stdout.strip()) / (1024**3)
            except:
                pass

        # Disk space (current directory)
        disk_gb = 0
        try:
            statvfs = os.statvfs('.')
            disk_gb = (statvfs.f_frsize * statvfs.f_bavail) / (1024**3)
        except:
            pass

        details = f"CPUs: {cpu_count}, Memory: {mem_gb:.1f}GB, Disk: {disk_gb:.1f}GB available"

        # Check minimums
        warnings = []
        if cpu_count < 4:
            warnings.append(f"Low CPU count ({cpu_count}). Consider --max_cpus {cpu_count}")
        if 0 < mem_gb < 8:
            warnings.append(f"Low memory ({mem_gb:.1f}GB). Use --max_memory '{int(mem_gb)}GB'")
        if 0 < disk_gb < 50:
            warnings.append(f"Low disk space ({disk_gb:.1f}GB). Pipelines need ~100GB for human data")

        if warnings:
            return CheckResult(
                name="Resources",
                passed=True,
                message="Resources available (with warnings)",
                details=details,
                fix="; ".join(warnings)
            )

        return CheckResult(
            name="Resources",
            passed=True,
            message="Sufficient resources available",
            details=details
        )

    except Exception as e:
        return CheckResult(
            name="Resources",
            passed=True,  # Don't fail on resource check errors
            message=f"Could not fully check resources: {str(e)}"
        )


def check_network() -> CheckResult:
    """Check network connectivity to Docker Hub and nf-core."""
    try:
        import urllib.request

        # User-Agent header to avoid 403 from sites that block default Python agent
        headers = {'User-Agent': 'nf-core-helper/1.0'}

        # Try Docker Hub
        try:
            req = urllib.request.Request("https://hub.docker.com", headers=headers)
            urllib.request.urlopen(req, timeout=10)
            docker_hub_ok = True
        except:
            docker_hub_ok = False

        # Try nf-core (for pipeline downloads)
        try:
            req = urllib.request.Request("https://nf-co.re", headers=headers)
            urllib.request.urlopen(req, timeout=10)
            nfcore_ok = True
        except:
            nfcore_ok = False

        if docker_hub_ok and nfcore_ok:
            return CheckResult(
                name="Network",
                passed=True,
                message="Network connectivity OK (Docker Hub & nf-core reachable)"
            )
        elif docker_hub_ok:
            return CheckResult(
                name="Network",
                passed=True,
                message="Docker Hub reachable (nf-core.re not reachable)",
                details="Pipeline downloads may still work via GitHub"
            )
        else:
            return CheckResult(
                name="Network",
                passed=False,
                message="Cannot reach Docker Hub",
                fix="Check network connection. Containers require Docker Hub access."
            )

    except Exception as e:
        return CheckResult(
            name="Network",
            passed=False,
            message=f"Network check failed: {str(e)}",
            fix="Check network connection and proxy settings"
        )


def run_all_checks() -> EnvironmentReport:
    """Run all environment checks and return comprehensive report."""
    checks = [
        check_docker(),
        check_nextflow(),
        check_java(),
        check_resources(),
        check_network(),
    ]

    # Critical checks that must pass
    critical_checks = ["Docker", "Nextflow", "Java"]
    ready = all(c.passed for c in checks if c.name in critical_checks)

    # Build recommendations
    recommendations = []
    for check in checks:
        if not check.passed and check.fix:
            recommendations.append(f"{check.name}: {check.fix}")
        elif check.passed and check.fix:  # Warnings
            recommendations.append(f"{check.name} (warning): {check.fix}")

    return EnvironmentReport(
        ready=ready,
        checks=checks,
        recommendations=recommendations
    )


def print_report(report: EnvironmentReport):
    """Print human-readable report to stdout."""
    print("\n" + "=" * 50)
    print("  nf-core Environment Check")
    print("=" * 50 + "\n")

    for check in report.checks:
        status = "\033[92m[PASS]\033[0m" if check.passed else "\033[91m[FAIL]\033[0m"
        print(f"{status} {check.name}: {check.message}")

        if check.details:
            print(f"       {check.details}")

        if not check.passed and check.fix:
            print(f"       \033[93mFix:\033[0m {check.fix}")
        elif check.passed and check.fix:  # Warning
            print(f"       \033[93mWarning:\033[0m {check.fix}")

    print()
    if report.ready:
        print("\033[92m✓ Environment is READY for nf-core pipelines.\033[0m")
    else:
        print("\033[91m✗ Environment is NOT READY. Please address the issues above.\033[0m")

    if report.recommendations:
        print("\n--- Recommendations ---")
        for i, rec in enumerate(report.recommendations, 1):
            print(f"  {i}. {rec}")

    print()


def main():
    import argparse

    parser = argparse.ArgumentParser(
        description="Check environment for nf-core pipeline execution",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
    python check_environment.py           # Human-readable output
    python check_environment.py --json    # JSON output for parsing
        """
    )
    parser.add_argument("--json", action="store_true",
                        help="Output results as JSON")

    args = parser.parse_args()

    report = run_all_checks()

    if args.json:
        print(json.dumps(report.to_dict(), indent=2))
    else:
        print_report(report)

    sys.exit(0 if report.ready else 1)


if __name__ == "__main__":
    main()

#!/usr/bin/env python3
"""
Auto-detect appropriate nf-core pipeline from data directory.

Analyzes filenames, directory structure, and file content hints to suggest
the most appropriate pipeline for the data.

Usage:
    python detect_data_type.py /path/to/data
    python detect_data_type.py /path/to/data --json
"""

import argparse
import json
import os
import sys
from pathlib import Path
from typing import Dict, List, Tuple

import yaml


def load_all_pipeline_configs() -> Dict[str, Dict]:
    """Load all pipeline configurations."""
    config_dir = Path(__file__).parent / "config" / "pipelines"
    configs = {}

    for config_file in config_dir.glob("*.yaml"):
        if config_file.stem.startswith("_"):
            continue
        with open(config_file) as f:
            configs[config_file.stem] = yaml.safe_load(f)

    return configs


def scan_directory(directory: str) -> Dict:
    """Scan directory and collect file information."""
    info = {
        'fastq_count': 0,
        'bam_count': 0,
        'cram_count': 0,
        'filenames': [],
        'directories': [],
        'total_size_gb': 0,
    }

    directory = os.path.abspath(directory)

    for root, dirs, files in os.walk(directory):
        # Collect directory names
        rel_root = os.path.relpath(root, directory)
        if rel_root != '.':
            info['directories'].append(rel_root.lower())

        for filename in files:
            filename_lower = filename.lower()

            # Count file types
            if any(filename_lower.endswith(ext) for ext in ['.fastq.gz', '.fq.gz', '.fastq', '.fq']):
                info['fastq_count'] += 1
            elif filename_lower.endswith('.bam'):
                info['bam_count'] += 1
            elif filename_lower.endswith('.cram'):
                info['cram_count'] += 1

            # Collect filenames for pattern matching
            info['filenames'].append(filename_lower)

            # Sum file sizes
            try:
                size = os.path.getsize(os.path.join(root, filename))
                info['total_size_gb'] += size / (1024**3)
            except:
                pass

    return info


def calculate_pipeline_scores(scan_info: Dict, configs: Dict) -> Dict[str, Dict]:
    """Calculate confidence scores for each pipeline."""
    scores = {}

    for pipeline_name, config in configs.items():
        score = 0
        matches = []

        # Check detection hints
        hints = config.get('detection_hints', {})

        # Filename hints
        filename_hints = hints.get('filename', [])
        for hint in filename_hints:
            hint_lower = hint.lower()
            for filename in scan_info['filenames']:
                if hint_lower in filename:
                    score += 10
                    matches.append(f"Filename contains '{hint}'")
                    break

        # Directory hints
        directory_hints = hints.get('directory', [])
        for hint in directory_hints:
            hint_lower = hint.lower()
            for dirname in scan_info['directories']:
                if hint_lower in dirname:
                    score += 15
                    matches.append(f"Directory contains '{hint}'")
                    break

        # Check data type compatibility
        data_types = config.get('data_types', [])
        input_types = config.get('samplesheet', {}).get('input_types', ['fastq'])

        # Prefer pipelines that support the available file types
        if 'fastq' in input_types and scan_info['fastq_count'] > 0:
            score += 5
        if 'bam' in input_types and scan_info['bam_count'] > 0:
            score += 5
        if 'cram' in input_types and scan_info['cram_count'] > 0:
            score += 5

        # Pipeline-specific boosts
        if pipeline_name == 'sarek':
            # Check for tumor/normal indicators
            tumor_indicators = ['tumor', 'tumour', 'cancer', 'met', 'primary']
            normal_indicators = ['normal', 'germline', 'blood', 'control']

            has_tumor = any(ind in ' '.join(scan_info['filenames']) for ind in tumor_indicators)
            has_normal = any(ind in ' '.join(scan_info['filenames']) for ind in normal_indicators)

            if has_tumor or has_normal:
                score += 20
                if has_tumor:
                    matches.append("Found tumor sample indicators")
                if has_normal:
                    matches.append("Found normal sample indicators")

            # DNA-related hints
            dna_hints = ['wgs', 'wes', 'exome', 'dna', 'variant', 'snp', 'indel']
            for hint in dna_hints:
                if hint in ' '.join(scan_info['filenames'] + scan_info['directories']):
                    score += 10
                    matches.append(f"Found DNA/variant indicator: '{hint}'")
                    break

        elif pipeline_name == 'rnaseq':
            # RNA-related hints
            rna_hints = ['rna', 'rnaseq', 'mrna', 'expression', 'transcript', 'counts']
            for hint in rna_hints:
                if hint in ' '.join(scan_info['filenames'] + scan_info['directories']):
                    score += 15
                    matches.append(f"Found RNA indicator: '{hint}'")
                    break

        elif pipeline_name == 'atacseq':
            # ATAC-related hints
            atac_hints = ['atac', 'atacseq', 'chromatin', 'accessibility', 'peak', 'macs']
            for hint in atac_hints:
                if hint in ' '.join(scan_info['filenames'] + scan_info['directories']):
                    score += 20
                    matches.append(f"Found ATAC-seq indicator: '{hint}'")
                    break

        scores[pipeline_name] = {
            'score': score,
            'matches': matches,
            'description': config.get('description', ''),
            'version': config.get('version', 'unknown'),
        }

    return scores


def detect_pipeline(directory: str) -> Tuple[str, Dict]:
    """
    Detect the most appropriate pipeline for the data.

    Args:
        directory: Path to data directory

    Returns:
        Tuple of (recommended_pipeline, all_scores)
    """
    if not os.path.isdir(directory):
        raise ValueError(f"Not a directory: {directory}")

    configs = load_all_pipeline_configs()
    scan_info = scan_directory(directory)

    # Check if any sequencing files found
    total_files = scan_info['fastq_count'] + scan_info['bam_count'] + scan_info['cram_count']
    if total_files == 0:
        raise ValueError(f"No sequencing files (FASTQ/BAM/CRAM) found in {directory}")

    scores = calculate_pipeline_scores(scan_info, configs)

    # Find highest scoring pipeline
    best_pipeline = max(scores.keys(), key=lambda k: scores[k]['score'])

    return best_pipeline, scores


def print_results(
    directory: str,
    recommended: str,
    scores: Dict,
    scan_info: Dict,
    output_json: bool = False
):
    """Print detection results."""
    if output_json:
        result = {
            'recommended': recommended,
            'scores': scores,
            'scan_info': {
                'fastq_count': scan_info['fastq_count'],
                'bam_count': scan_info['bam_count'],
                'cram_count': scan_info['cram_count'],
                'total_size_gb': round(scan_info['total_size_gb'], 2),
            }
        }
        print(json.dumps(result, indent=2))
        return

    print("\n" + "=" * 50)
    print("  nf-core Pipeline Detection")
    print("=" * 50)
    print(f"\nDirectory: {directory}")
    print(f"Files found: {scan_info['fastq_count']} FASTQ, "
          f"{scan_info['bam_count']} BAM, {scan_info['cram_count']} CRAM")
    print(f"Total size: {scan_info['total_size_gb']:.1f} GB")

    print("\n--- Pipeline Scores ---")
    sorted_pipelines = sorted(scores.keys(), key=lambda k: scores[k]['score'], reverse=True)

    for pipeline in sorted_pipelines:
        info = scores[pipeline]
        indicator = "→" if pipeline == recommended else " "
        print(f"\n{indicator} {pipeline} (score: {info['score']})")
        print(f"  {info['description']}")
        if info['matches']:
            print(f"  Matches: {', '.join(info['matches'][:3])}")

    print(f"\n{'=' * 50}")
    print(f"\n\033[92mRecommended: {recommended}\033[0m")
    print(f"Version: {scores[recommended]['version']}")

    # Print suggested next steps
    print(f"\n--- Next Steps ---")
    print(f"1. Run environment check:")
    print(f"   python scripts/check_environment.py")
    print(f"\n2. Run test profile:")
    config = load_all_pipeline_configs().get(recommended, {})
    test_cmd = config.get('test_profile', {}).get('command', '')
    if test_cmd:
        print(f"   {test_cmd}")
    print(f"\n3. Generate samplesheet:")
    print(f"   python scripts/generate_samplesheet.py {directory} {recommended}")


def main():
    parser = argparse.ArgumentParser(
        description='Detect appropriate nf-core pipeline for data',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
    %(prog)s ./data
    %(prog)s ./fastqs --json
        """
    )

    parser.add_argument('directory', help='Directory containing sequencing data')
    parser.add_argument('--json', action='store_true', help='Output as JSON')

    args = parser.parse_args()

    try:
        scan_info = scan_directory(args.directory)
        recommended, scores = detect_pipeline(args.directory)
        print_results(args.directory, recommended, scores, scan_info, args.json)
        sys.exit(0)

    except ValueError as e:
        if args.json:
            print(json.dumps({'error': str(e)}))
        else:
            print(f"Error: {e}")
        sys.exit(1)

    except Exception as e:
        if args.json:
            print(json.dumps({'error': str(e)}))
        else:
            print(f"Error: {e}")
        sys.exit(1)


if __name__ == '__main__':
    main()

#!/usr/bin/env python3
"""
Enhanced nf-core samplesheet generator.

Features:
- FASTQ, BAM, and CRAM support
- Tumor/normal status inference for sarek
- Robust R1/R2 matching with scoring
- Pre-write validation with clear error messages
- Pipeline config-driven column generation

Usage:
    python generate_samplesheet.py /path/to/data rnaseq -o samplesheet.csv
    python generate_samplesheet.py /path/to/bams sarek --input-type bam
    python generate_samplesheet.py --validate samplesheet.csv rnaseq
"""

import argparse
import os
import sys
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import yaml

# Add parent directory to path for utils import
sys.path.insert(0, str(Path(__file__).parent))

from utils.file_discovery import discover_files, detect_input_type, find_index_file
from utils.sample_inference import (
    extract_sample_info,
    infer_tumor_normal_status,
    match_read_pairs,
    extract_replicate_number
)
from utils.validators import validate_samplesheet, ValidationResult


def load_pipeline_config(pipeline: str) -> Dict:
    """Load pipeline configuration from YAML."""
    config_dir = Path(__file__).parent / "config" / "pipelines"
    config_file = config_dir / f"{pipeline}.yaml"

    if not config_file.exists():
        available = [f.stem for f in config_dir.glob("*.yaml") if not f.stem.startswith("_")]
        raise ValueError(f"Unknown pipeline '{pipeline}'. Available: {', '.join(available)}")

    with open(config_file) as f:
        return yaml.safe_load(f)


def generate_samplesheet(
    input_dir: str,
    pipeline: str,
    output_file: Optional[str] = None,
    input_type: str = "auto",
    single_end: bool = False,
    interactive: bool = True
) -> Tuple[Optional[str], ValidationResult]:
    """
    Generate samplesheet for specified pipeline.

    Args:
        input_dir: Directory containing sequencing files
        pipeline: Pipeline name (rnaseq, sarek, atacseq)
        output_file: Output CSV path (default: samplesheet_{pipeline}.csv)
        input_type: File type (auto, fastq, bam, cram)
        single_end: Suppress pairing warnings for single-end data
        interactive: Prompt for missing info

    Returns:
        Tuple of (output_path, validation_result)
    """
    config = load_pipeline_config(pipeline)
    samplesheet_config = config.get("samplesheet", {})
    supported_types = samplesheet_config.get("input_types", ["fastq"])

    # Determine input type
    if input_type == "auto":
        input_type = detect_input_type(input_dir)
        print(f"Auto-detected input type: {input_type.upper()}")

    if input_type not in supported_types:
        return None, ValidationResult(
            valid=False,
            errors=[f"Pipeline '{pipeline}' does not support {input_type.upper()} input. "
                    f"Supported: {supported_types}"]
        )

    # Discover files
    try:
        files = discover_files(input_dir, input_type)
    except ValueError as e:
        return None, ValidationResult(valid=False, errors=[str(e)])

    if not files:
        return None, ValidationResult(
            valid=False,
            errors=[f"No {input_type.upper()} files found in {input_dir}"],
            suggestions=[
                "Check directory path is correct",
                "Verify file extensions (.fastq.gz, .fq.gz, .bam, .cram)",
                f"Run: ls {input_dir}"
            ]
        )

    print(f"Found {len(files)} {input_type.upper()} files")

    # Process based on input type
    if input_type == "fastq":
        rows = _process_fastq_files(files, config, single_end)
    else:
        rows = _process_alignment_files(files, config, input_type)

    if not rows:
        return None, ValidationResult(
            valid=False,
            errors=["Could not generate any samplesheet rows from files"]
        )

    print(f"Generated {len(rows)} samplesheet rows")

    # Pipeline-specific processing
    if pipeline == "sarek":
        rows = _process_sarek_samples(rows, interactive)
    elif pipeline == "atacseq":
        rows = _process_atacseq_samples(rows)

    # Validate before writing
    validation = validate_samplesheet(rows, pipeline, config)

    if not validation.valid:
        print("\nValidation errors:")
        for error in validation.errors:
            print(f"  - {error}")

        if interactive:
            response = input("\nProceed anyway? [y/N]: ").strip().lower()
            if response != 'y':
                return None, validation
    elif validation.warnings:
        print("\nWarnings:")
        for warning in validation.warnings:
            print(f"  - {warning}")

    # Determine output path
    output_path = output_file or f"samplesheet_{pipeline}.csv"

    # Write samplesheet
    _write_samplesheet(rows, config, output_path)

    print(f"\nGenerated: {output_path}")
    print(f"  Pipeline: {pipeline} v{config.get('version', 'unknown')}")
    print(f"  Samples: {len(set(r.get('sample', r.get('patient', '')) for r in rows))}")
    print(f"  Rows: {len(rows)}")

    # Preview
    _print_preview(rows, config)

    return output_path, validation


def _process_fastq_files(files, config: Dict, single_end: bool) -> List[Dict]:
    """Process FASTQ files into samplesheet rows."""
    pairs = match_read_pairs(files)

    if not pairs:
        return []

    # Check for unpaired files
    unpaired = [k for k, v in pairs.items() if v.get('r1') and not v.get('r2')]
    if unpaired and not single_end:
        print(f"\nNote: {len(unpaired)} samples appear to be single-end (no R2)")

    rows = []
    columns = config.get("samplesheet", {}).get("columns", [])

    for sample_key, pair_info in sorted(pairs.items()):
        if not pair_info.get('r1'):
            continue  # Skip entries with only R2

        info = pair_info.get('info', {})

        row = {
            'sample': info.get('sample', sample_key),
            'fastq_1': str(Path(pair_info['r1']).absolute()),
            'fastq_2': str(Path(pair_info['r2']).absolute()) if pair_info.get('r2') else '',
        }

        # Add additional info from filename
        if 'patient' in [c['name'] for c in columns]:
            row['patient'] = info.get('patient', info.get('sample', sample_key))

        if 'lane' in [c['name'] for c in columns]:
            row['lane'] = info.get('lane', 'L001')

        # Apply defaults from config
        for col in columns:
            if col['name'] not in row and 'default' in col:
                row[col['name']] = col['default']

        rows.append(row)

    return rows


def _process_alignment_files(files, config: Dict, input_type: str) -> List[Dict]:
    """Process BAM/CRAM files into samplesheet rows."""
    rows = []
    columns = config.get("samplesheet", {}).get("columns", [])

    for file_info in files:
        # Find index file
        index_path = find_index_file(file_info.path)

        info = extract_sample_info(file_info.path)

        row = {
            'sample': info.get('sample', file_info.stem),
            'bam': str(Path(file_info.path).absolute()),
            'bai': str(Path(index_path).absolute()) if index_path else '',
        }

        # Add patient for sarek
        if 'patient' in [c['name'] for c in columns]:
            row['patient'] = info.get('patient', info.get('sample', file_info.stem))

        # Apply defaults
        for col in columns:
            if col['name'] not in row and 'default' in col:
                row[col['name']] = col['default']

        # Warn if no index found
        if not index_path:
            print(f"  Warning: No index found for {file_info.name}")

        rows.append(row)

    return rows


def _process_sarek_samples(rows: List[Dict], interactive: bool) -> List[Dict]:
    """Process sarek samples: infer and confirm tumor/normal status."""
    # Auto-infer status from sample names
    for row in rows:
        sample_name = row.get('sample', '')
        inferred = infer_tumor_normal_status(sample_name)
        if inferred is not None:
            row['status'] = inferred

    # Report inference results
    inferred_tumor = [r for r in rows if r.get('status') == 1]
    inferred_normal = [r for r in rows if r.get('status') == 0]
    unknown = [r for r in rows if 'status' not in r]

    if inferred_tumor or inferred_normal:
        print(f"\nTumor/normal inference:")
        print(f"  Tumor samples: {len(inferred_tumor)}")
        print(f"  Normal samples: {len(inferred_normal)}")

    # Handle unknown samples
    if unknown and interactive:
        print(f"\n{len(unknown)} sample(s) with unknown status:")
        for r in unknown:
            print(f"  - {r.get('sample')}")

        print("\nSpecify status for each (0=normal, 1=tumor, Enter=skip):")
        for r in unknown:
            response = input(f"  {r.get('sample')} [0/1/Enter]: ").strip()
            if response in ['0', '1']:
                r['status'] = int(response)
            else:
                r['status'] = 0  # Default to normal
                print(f"    Defaulting to normal (0)")
    elif unknown:
        # Non-interactive: default to normal
        for r in unknown:
            r['status'] = 0

    return rows


def _process_atacseq_samples(rows: List[Dict]) -> List[Dict]:
    """Process ATAC-seq samples: ensure replicate numbers."""
    # Group by sample name
    sample_counts = {}
    for row in rows:
        sample = row.get('sample', '')
        if sample not in sample_counts:
            sample_counts[sample] = 0
        sample_counts[sample] += 1

    # Assign replicate numbers if not present
    sample_rep = {}
    for row in rows:
        sample = row.get('sample', '')

        if 'replicate' not in row or not row['replicate']:
            # Try to extract from filename
            extracted = extract_replicate_number(row.get('fastq_1', ''))
            if extracted:
                row['replicate'] = extracted
            else:
                # Auto-assign sequential
                if sample not in sample_rep:
                    sample_rep[sample] = 0
                sample_rep[sample] += 1
                row['replicate'] = sample_rep[sample]

    return rows


def _write_samplesheet(rows: List[Dict], config: Dict, output_path: str):
    """Write samplesheet to CSV file."""
    columns = config.get("samplesheet", {}).get("columns", [])
    column_names = [c['name'] for c in columns]

    # Filter to columns that have data
    active_columns = [c for c in column_names if any(c in row and row[c] for row in rows)]

    # Ensure fastq_1/fastq_2 or bam/bai are included
    for required in ['fastq_1', 'bam']:
        if required in column_names and required not in active_columns:
            if any(required in row for row in rows):
                active_columns.append(required)

    # Maintain original column order
    active_columns = [c for c in column_names if c in active_columns]

    with open(output_path, 'w') as f:
        f.write(','.join(active_columns) + '\n')
        for row in rows:
            values = [str(row.get(col, '')) for col in active_columns]
            f.write(','.join(values) + '\n')


def _print_preview(rows: List[Dict], config: Dict):
    """Print preview of generated samplesheet."""
    columns = config.get("samplesheet", {}).get("columns", [])
    column_names = [c['name'] for c in columns]
    active_columns = [c for c in column_names if any(c in row for row in rows)]

    print(f"\nPreview (first 3 rows):")
    print(','.join(active_columns))
    for row in rows[:3]:
        values = [str(row.get(col, ''))[:40] for col in active_columns]  # Truncate long paths
        print(','.join(values))
    if len(rows) > 3:
        print(f"... ({len(rows) - 3} more rows)")


def validate_existing_samplesheet(csv_path: str, pipeline: str) -> ValidationResult:
    """Validate an existing samplesheet file."""
    import csv

    if not os.path.exists(csv_path):
        return ValidationResult(valid=False, errors=[f"File not found: {csv_path}"])

    try:
        with open(csv_path, 'r') as f:
            reader = csv.DictReader(f)
            rows = list(reader)
    except Exception as e:
        return ValidationResult(valid=False, errors=[f"Failed to read CSV: {e}"])

    if not rows:
        return ValidationResult(valid=False, errors=["Samplesheet is empty"])

    config = load_pipeline_config(pipeline)
    return validate_samplesheet(rows, pipeline, config)


def main():
    parser = argparse.ArgumentParser(
        description='Generate nf-core samplesheet from data directory',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
    # Generate samplesheet for RNA-seq
    %(prog)s ./fastqs rnaseq -o samples.csv

    # Generate samplesheet for sarek from BAM files
    %(prog)s ./bams sarek --input-type bam

    # Validate existing samplesheet
    %(prog)s --validate samplesheet.csv rnaseq

Supported pipelines: rnaseq, sarek, atacseq
        """
    )

    parser.add_argument('input', help='Directory with data files, or CSV path for --validate')
    parser.add_argument('pipeline', help='Pipeline name (rnaseq, sarek, atacseq)')
    parser.add_argument('-o', '--output', help='Output CSV filename')
    parser.add_argument('--input-type', choices=['auto', 'fastq', 'bam', 'cram'],
                        default='auto', help='Input file type (default: auto-detect)')
    parser.add_argument('--single-end', action='store_true',
                        help='Treat as single-end data (suppress pairing warnings)')
    parser.add_argument('--validate', action='store_true',
                        help='Validate existing samplesheet instead of generating')
    parser.add_argument('--no-interactive', action='store_true',
                        help='Non-interactive mode (use defaults)')

    args = parser.parse_args()

    try:
        if args.validate:
            # Validate existing samplesheet
            result = validate_existing_samplesheet(args.input, args.pipeline)
            if result.valid:
                print(f"✓ Samplesheet is valid for {args.pipeline}")
                if result.warnings:
                    print("\nWarnings:")
                    for w in result.warnings:
                        print(f"  - {w}")
                sys.exit(0)
            else:
                print(f"✗ Samplesheet validation failed")
                print(result.summary())
                sys.exit(1)
        else:
            # Generate new samplesheet
            if not os.path.isdir(args.input):
                print(f"Error: Not a directory: {args.input}")
                sys.exit(1)

            output_path, result = generate_samplesheet(
                args.input,
                args.pipeline,
                args.output,
                args.input_type,
                args.single_end,
                interactive=not args.no_interactive
            )

            if output_path is None:
                print("\nFailed to generate samplesheet.")
                if result.suggestions:
                    print("\nSuggestions:")
                    for s in result.suggestions:
                        print(f"  - {s}")
                sys.exit(1)

            sys.exit(0)

    except ValueError as e:
        print(f"Error: {e}")
        sys.exit(1)
    except KeyboardInterrupt:
        print("\nAborted.")
        sys.exit(1)


if __name__ == '__main__':
    main()

#!/usr/bin/env python3
"""
Genome reference management for nf-core pipelines.

Manages downloading, caching, and accessing genome references from iGenomes.
Supports auto-download when references aren't available locally.

Usage:
    python manage_genomes.py list
    python manage_genomes.py check GRCh38
    python manage_genomes.py download GRCh38
    python manage_genomes.py params GRCh38
"""

import argparse
import json
import os
import subprocess
import sys
from pathlib import Path
from typing import Dict, List, Optional


# iGenomes reference configuration
IGENOMES = {
    # Human
    'GRCh38': {
        'display_name': 'Human GRCh38/hg38',
        'species': 'Homo sapiens',
        'aliases': ['hg38', 'GRCh38.p14'],
        's3_base': 's3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh38',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    'GRCh37': {
        'display_name': 'Human GRCh37/hg19',
        'species': 'Homo sapiens',
        'aliases': ['hg19', 'GRCh37.p13'],
        's3_base': 's3://ngi-igenomes/igenomes/Homo_sapiens/NCBI/GRCh37',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    # Mouse
    'GRCm39': {
        'display_name': 'Mouse GRCm39/mm39',
        'species': 'Mus musculus',
        'aliases': ['mm39', 'GRCm39'],
        's3_base': 's3://ngi-igenomes/igenomes/Mus_musculus/Ensembl/GRCm39',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    'GRCm38': {
        'display_name': 'Mouse GRCm38/mm10',
        'species': 'Mus musculus',
        'aliases': ['mm10', 'GRCm38'],
        's3_base': 's3://ngi-igenomes/igenomes/Mus_musculus/NCBI/GRCm38',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    # Yeast
    'R64-1-1': {
        'display_name': 'Yeast R64-1-1/sacCer3',
        'species': 'Saccharomyces cerevisiae',
        'aliases': ['sacCer3', 'S288C', 'yeast'],
        's3_base': 's3://ngi-igenomes/igenomes/Saccharomyces_cerevisiae/Ensembl/R64-1-1',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    # Fruit fly
    'BDGP6': {
        'display_name': 'Drosophila BDGP6/dm6',
        'species': 'Drosophila melanogaster',
        'aliases': ['dm6', 'BDGP6', 'fly'],
        's3_base': 's3://ngi-igenomes/igenomes/Drosophila_melanogaster/Ensembl/BDGP6',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
        }
    },
    # C. elegans
    'WBcel235': {
        'display_name': 'C. elegans WBcel235/ce11',
        'species': 'Caenorhabditis elegans',
        'aliases': ['ce11', 'worm'],
        's3_base': 's3://ngi-igenomes/igenomes/Caenorhabditis_elegans/Ensembl/WBcel235',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    # Zebrafish
    'GRCz11': {
        'display_name': 'Zebrafish GRCz11/danRer11',
        'species': 'Danio rerio',
        'aliases': ['danRer11', 'zebrafish'],
        's3_base': 's3://ngi-igenomes/igenomes/Danio_rerio/Ensembl/GRCz11',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    'GRCz10': {
        'display_name': 'Zebrafish GRCz10/danRer10',
        'species': 'Danio rerio',
        'aliases': ['danRer10'],
        's3_base': 's3://ngi-igenomes/igenomes/Danio_rerio/Ensembl/GRCz10',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
        }
    },
    # Rat
    'Rnor_6.0': {
        'display_name': 'Rat Rnor_6.0/rn6',
        'species': 'Rattus norvegicus',
        'aliases': ['rn6', 'Rnor6', 'rat'],
        's3_base': 's3://ngi-igenomes/igenomes/Rattus_norvegicus/Ensembl/Rnor_6.0',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    # Arabidopsis
    'TAIR10': {
        'display_name': 'Arabidopsis TAIR10',
        'species': 'Arabidopsis thaliana',
        'aliases': ['arabidopsis'],
        's3_base': 's3://ngi-igenomes/igenomes/Arabidopsis_thaliana/Ensembl/TAIR10',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
            'bwa_index': 'Sequence/BWAIndex/',
            'star_index': 'Sequence/STARIndex/',
        }
    },
    # Chicken
    'GRCg6a': {
        'display_name': 'Chicken GRCg6a/galGal6',
        'species': 'Gallus gallus',
        'aliases': ['galGal6', 'chicken'],
        's3_base': 's3://ngi-igenomes/igenomes/Gallus_gallus/Ensembl/GRCg6a',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
        }
    },
    # Dog
    'CanFam3.1': {
        'display_name': 'Dog CanFam3.1/canFam3',
        'species': 'Canis lupus familiaris',
        'aliases': ['canFam3', 'dog'],
        's3_base': 's3://ngi-igenomes/igenomes/Canis_familiaris/Ensembl/CanFam3.1',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
        }
    },
    # Pig
    'Sscrofa11.1': {
        'display_name': 'Pig Sscrofa11.1/susScr11',
        'species': 'Sus scrofa',
        'aliases': ['susScr11', 'pig'],
        's3_base': 's3://ngi-igenomes/igenomes/Sus_scrofa/Ensembl/Sscrofa11.1',
        'files': {
            'fasta': 'Sequence/WholeGenomeFasta/genome.fa',
            'gtf': 'Annotation/Genes/genes.gtf',
        }
    },
}


def get_cache_dir() -> Path:
    """Get genome cache directory."""
    cache_dir = os.environ.get(
        'NF_CORE_GENOME_CACHE',
        os.path.expanduser('~/.nf-core/genomes')
    )
    return Path(cache_dir)


def resolve_genome_id(genome: str) -> Optional[str]:
    """Resolve genome ID from name or alias."""
    # Direct match
    if genome in IGENOMES:
        return genome

    # Check aliases
    genome_lower = genome.lower()
    for gid, info in IGENOMES.items():
        if genome_lower in [a.lower() for a in info.get('aliases', [])]:
            return gid

    return None


def is_genome_installed(genome_id: str) -> bool:
    """Check if genome is installed locally."""
    cache_dir = get_cache_dir()
    genome_dir = cache_dir / genome_id

    # Check for fasta as minimum requirement
    fasta_path = genome_dir / 'genome.fa'
    return fasta_path.exists()


def get_genome_path(genome_id: str) -> Optional[Path]:
    """Get local path to genome if installed."""
    if not is_genome_installed(genome_id):
        return None
    return get_cache_dir() / genome_id


def list_genomes(installed_only: bool = False) -> List[Dict]:
    """List available genomes."""
    result = []

    for genome_id, info in IGENOMES.items():
        installed = is_genome_installed(genome_id)

        if installed_only and not installed:
            continue

        genome_path = get_genome_path(genome_id) if installed else None

        result.append({
            'id': genome_id,
            'display_name': info['display_name'],
            'species': info['species'],
            'aliases': info.get('aliases', []),
            'installed': installed,
            'path': str(genome_path) if genome_path else None,
        })

    return result


def download_genome(
    genome_id: str,
    components: Optional[List[str]] = None,
    force: bool = False
) -> bool:
    """
    Download genome reference files from iGenomes.

    Args:
        genome_id: Genome identifier (e.g., GRCh38)
        components: Specific components to download (fasta, gtf, etc.)
        force: Overwrite existing files

    Returns:
        True if successful
    """
    # Resolve genome ID
    resolved = resolve_genome_id(genome_id)
    if not resolved:
        print(f"Unknown genome: {genome_id}")
        print(f"Available: {', '.join(IGENOMES.keys())}")
        return False

    genome_id = resolved
    info = IGENOMES[genome_id]

    # Check for AWS CLI
    aws_available = subprocess.run(
        ['which', 'aws'],
        capture_output=True
    ).returncode == 0

    if not aws_available:
        print("AWS CLI not found. Required for iGenomes download.")
        print("Install with: pip install awscli")
        print("\nAlternative: Use --genome flag with nf-core pipelines")
        print("which will auto-download references (slower, per-run).")
        return False

    # Create cache directory
    cache_dir = get_cache_dir()
    genome_dir = cache_dir / genome_id
    genome_dir.mkdir(parents=True, exist_ok=True)

    # Determine components to download
    if components is None:
        components = ['fasta', 'gtf']  # Minimum required

    print(f"Downloading {info['display_name']} to {genome_dir}")
    print(f"Components: {', '.join(components)}")

    success = True
    for component in components:
        if component not in info.get('files', {}):
            print(f"  Skipping {component}: not available for {genome_id}")
            continue

        remote_path = info['files'][component]
        s3_path = f"{info['s3_base']}/{remote_path}"

        # Determine local path
        if remote_path.endswith('/'):
            # Directory (e.g., index)
            local_path = genome_dir / component
        else:
            # File
            filename = Path(remote_path).name
            local_path = genome_dir / filename

        if local_path.exists() and not force:
            print(f"  {component}: Already exists (use --force to overwrite)")
            continue

        print(f"  Downloading {component}...")

        # Build AWS command
        cmd = ['aws', 's3', 'cp', '--no-sign-request']

        if remote_path.endswith('/'):
            cmd.extend(['--recursive', s3_path, str(local_path)])
        else:
            cmd.extend([s3_path, str(local_path)])

        result = subprocess.run(cmd, capture_output=True, text=True)

        if result.returncode != 0:
            print(f"  ERROR downloading {component}:")
            print(f"    {result.stderr[:200]}")
            success = False
        else:
            print(f"  {component}: Downloaded successfully")

    if success:
        print(f"\nGenome {genome_id} ready at: {genome_dir}")
    else:
        print(f"\nSome components failed to download.")

    return success


def get_nextflow_params(genome_id: str) -> Dict[str, str]:
    """
    Get Nextflow parameters for a genome.

    Returns dict with --fasta, --gtf if local,
    or just --genome if using iGenomes key.
    """
    resolved = resolve_genome_id(genome_id)
    if not resolved:
        return {'error': f'Unknown genome: {genome_id}'}

    genome_id = resolved

    # Check if installed locally
    genome_path = get_genome_path(genome_id)

    if genome_path:
        params = {}

        # Check for local files
        fasta = genome_path / 'genome.fa'
        if fasta.exists():
            params['fasta'] = str(fasta)

        gtf = genome_path / 'genes.gtf'
        if gtf.exists():
            params['gtf'] = str(gtf)

        if params:
            return params

    # Fall back to iGenomes key
    return {'genome': genome_id}


def print_genome_list(genomes: List[Dict], output_json: bool = False):
    """Print genome list."""
    if output_json:
        print(json.dumps(genomes, indent=2))
        return

    print("\n" + "=" * 50)
    print("  Available Genomes")
    print("=" * 50 + "\n")

    for g in genomes:
        status = "\033[92m[installed]\033[0m" if g['installed'] else ""
        print(f"  {g['id']}: {g['display_name']} {status}")
        print(f"      Species: {g['species']}")
        print(f"      Aliases: {', '.join(g['aliases'])}")
        if g['path']:
            print(f"      Path: {g['path']}")
        print()


def main():
    parser = argparse.ArgumentParser(
        description='Manage genome references for nf-core pipelines',
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Commands:
    list              List available genomes
    check <genome>    Check if genome is installed
    download <genome> Download genome from iGenomes
    params <genome>   Get Nextflow parameters for genome

Examples:
    %(prog)s list
    %(prog)s list --installed
    %(prog)s check GRCh38
    %(prog)s download GRCh38
    %(prog)s download GRCh38 --components fasta gtf star_index
    %(prog)s params GRCh38
        """
    )

    subparsers = parser.add_subparsers(dest='command', help='Commands')

    # List command
    list_parser = subparsers.add_parser('list', help='List available genomes')
    list_parser.add_argument('--installed', action='store_true',
                             help='Show only installed genomes')
    list_parser.add_argument('--json', action='store_true',
                             help='Output as JSON')

    # Check command
    check_parser = subparsers.add_parser('check', help='Check if genome is installed')
    check_parser.add_argument('genome', help='Genome ID (e.g., GRCh38)')
    check_parser.add_argument('--json', action='store_true',
                              help='Output as JSON')

    # Download command
    dl_parser = subparsers.add_parser('download', help='Download genome from iGenomes')
    dl_parser.add_argument('genome', help='Genome ID (e.g., GRCh38)')
    dl_parser.add_argument('--components', nargs='+',
                           help='Specific components (fasta, gtf, bwa_index, star_index)')
    dl_parser.add_argument('--force', action='store_true',
                           help='Overwrite existing files')

    # Params command
    params_parser = subparsers.add_parser('params', help='Get Nextflow params for genome')
    params_parser.add_argument('genome', help='Genome ID')
    params_parser.add_argument('--json', action='store_true',
                               help='Output as JSON')

    args = parser.parse_args()

    if args.command == 'list':
        genomes = list_genomes(installed_only=args.installed)
        print_genome_list(genomes, args.json)

    elif args.command == 'check':
        resolved = resolve_genome_id(args.genome)
        if not resolved:
            print(f"Unknown genome: {args.genome}")
            sys.exit(1)

        installed = is_genome_installed(resolved)
        path = get_genome_path(resolved) if installed else None

        if args.json:
            print(json.dumps({
                'genome': resolved,
                'installed': installed,
                'path': str(path) if path else None
            }))
        else:
            if installed:
                print(f"✓ Genome {resolved} is installed at: {path}")
            else:
                print(f"✗ Genome {resolved} is not installed locally")
                print(f"  Download with: python {sys.argv[0]} download {resolved}")

        sys.exit(0 if installed else 1)

    elif args.command == 'download':
        success = download_genome(args.genome, args.components, args.force)
        sys.exit(0 if success else 1)

    elif args.command == 'params':
        params = get_nextflow_params(args.genome)

        if args.json:
            print(json.dumps(params))
        else:
            if 'error' in params:
                print(f"Error: {params['error']}")
                sys.exit(1)

            for key, value in params.items():
                print(f"--{key} {value}")

    else:
        parser.print_help()
        sys.exit(1)


if __name__ == '__main__':
    main()

#!/usr/bin/env python3
"""
GEO/SRA Data Fetcher
====================
Download raw sequencing data from NCBI GEO/SRA and prepare for nf-core pipelines.

Usage:
    python sra_geo_fetch.py info <GEO_ID>              # Get study information
    python sra_geo_fetch.py list <GEO_ID>              # List all samples/runs
    python sra_geo_fetch.py download <GEO_ID> -o DIR   # Download FASTQ files
    python sra_geo_fetch.py samplesheet <GEO_ID> ...   # Generate samplesheet

Examples:
    python sra_geo_fetch.py info GSE110004
    python sra_geo_fetch.py list GSE110004 --filter "RNA-Seq:PAIRED"
    python sra_geo_fetch.py download GSE110004 -o ./fastq --parallel 4
    python sra_geo_fetch.py samplesheet GSE110004 --fastq-dir ./fastq -o samplesheet.csv
"""

import argparse
import json
import logging
import os
import re
import subprocess
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, asdict
from pathlib import Path
from typing import Dict, List, Optional, Tuple

# Add utils to path
sys.path.insert(0, str(Path(__file__).parent))
from utils.ncbi_utils import (
    check_network_access,
    fetch_geo_metadata,
    fetch_sra_study_accession,
    fetch_sra_run_info,
    fetch_sra_run_info_detailed,
    fetch_ena_fastq_urls,
    download_file,
    format_file_size,
    estimate_download_size,
    group_samples_by_type,
    format_sample_groups_table,
)

# Set up logging
logging.basicConfig(
    level=logging.INFO,
    format='%(message)s'
)
logger = logging.getLogger(__name__)

# Load genome mapping
SCRIPT_DIR = Path(__file__).parent
GENOMES_FILE = SCRIPT_DIR / "config" / "genomes.yaml"


@dataclass
class StudyInfo:
    """Information about a GEO study."""
    geo_id: str
    title: str
    organism: str
    n_samples: int
    summary: str
    sra_study: Optional[str]
    suggested_genome: Optional[str]
    suggested_pipeline: Optional[str]


def load_genome_mapping() -> Dict:
    """Load organism to genome mapping from config."""
    if not GENOMES_FILE.exists():
        return {}

    try:
        import yaml
        with open(GENOMES_FILE) as f:
            config = yaml.safe_load(f)
        return config.get('organisms', {})
    except ImportError:
        # Fallback: parse YAML manually for simple cases
        mapping = {}
        try:
            with open(GENOMES_FILE) as f:
                content = f.read()
            # Simple regex parsing for organism blocks
            pattern = r'"([^"]+)":\s*\n\s*genome:\s*"([^"]+)"'
            for match in re.finditer(pattern, content):
                mapping[match.group(1)] = {'genome': match.group(2)}
        except Exception:
            pass
        return mapping


def suggest_genome(organism: str) -> Optional[str]:
    """Suggest a genome based on organism name."""
    genome_map = load_genome_mapping()

    # Direct match
    if organism in genome_map:
        return genome_map[organism].get('genome')

    # Case-insensitive search
    organism_lower = organism.lower()
    for org_name, info in genome_map.items():
        if org_name.lower() == organism_lower:
            return info.get('genome')
        # Check aliases
        aliases = info.get('aliases', [])
        if any(alias.lower() == organism_lower for alias in aliases):
            return info.get('genome')

    # Common fallbacks
    fallbacks = {
        'homo sapiens': 'GRCh38',
        'human': 'GRCh38',
        'mus musculus': 'GRCm39',
        'mouse': 'GRCm39',
        'saccharomyces cerevisiae': 'R64-1-1',
        'yeast': 'R64-1-1',
        'drosophila melanogaster': 'BDGP6',
        'caenorhabditis elegans': 'WBcel235',
        'danio rerio': 'GRCz11',
        'arabidopsis thaliana': 'TAIR10',
        'rattus norvegicus': 'Rnor_6.0',
    }

    return fallbacks.get(organism_lower)


def suggest_pipeline(library_strategy: str, library_source: str = '') -> str:
    """Suggest nf-core pipeline based on library strategy."""
    strategy = library_strategy.upper()

    pipeline_map = {
        'RNA-SEQ': 'rnaseq',
        'ATAC-SEQ': 'atacseq',
        'CHIP-SEQ': 'chipseq',
        'WGS': 'sarek',
        'WXS': 'sarek',
        'AMPLICON': 'ampliseq',
        'BISULFITE-SEQ': 'methylseq',
        'HI-C': 'hic',
    }

    return pipeline_map.get(strategy, 'rnaseq')


def cmd_info(args):
    """Display study information."""
    geo_id = args.geo_id.upper()

    print(f"\nFetching information for {geo_id}...")

    # Check network
    network_ok, network_msg = check_network_access()
    if not network_ok:
        print(f"\n⚠️  Network issues detected:\n{network_msg}")

    # Get GEO metadata
    metadata = fetch_geo_metadata(geo_id)
    if not metadata:
        print(f"\n❌ Could not fetch metadata for {geo_id}")
        return 1

    # Get SRA study accession
    sra_study = fetch_sra_study_accession(geo_id)

    # Get detailed run info
    print("Fetching SRA run information...")
    runs = fetch_sra_run_info_detailed(geo_id)
    if not runs:
        # Fallback to basic method
        runs = fetch_sra_run_info(geo_id)

    # Group samples by type
    groups = group_samples_by_type(runs) if runs else {}

    # Suggest genome and pipeline
    organism = metadata.get('organism', 'Unknown')
    genome = suggest_genome(organism)

    # Determine primary data type
    primary_strategy = 'RNA-SEQ'
    if groups:
        primary_group = max(groups.items(), key=lambda x: x[1]['count'])
        primary_strategy = primary_group[1]['strategy']
    pipeline = suggest_pipeline(primary_strategy)

    # Estimate download size
    est_size = estimate_download_size(runs)

    # Display info
    print("\n" + "━" * 70)
    print(f"{geo_id}: {metadata.get('title', 'N/A')}")
    print("━" * 70)
    print(f"Organism:     {organism}")
    print(f"Samples:      {metadata.get('n_samples', 'N/A')}")
    print(f"SRA Study:    {sra_study or 'Not found'}")
    print(f"Runs:         {len(runs)}")
    print(f"Est. Size:    ~{format_file_size(est_size)}")
    print(f"Genome:       {genome or 'Unknown (manual selection required)'}")
    print(f"Pipeline:     nf-core/{pipeline} (suggested)")

    # Show sample groups table
    if groups:
        print(format_sample_groups_table(groups))

    if metadata.get('summary'):
        summary = metadata['summary']
        if len(summary) > 300:
            summary = summary[:297] + "..."
        print(f"\nSummary:\n  {summary}")

    print("━" * 70)

    # Show download hints
    if len(groups) > 1:
        print("\n💡 To download a specific subset, use:")
        for key in sorted(groups.keys()):
            print(f"   --subset \"{key}\"")

    # Save study info JSON
    if args.output_json:
        info = {
            'geo_id': geo_id,
            'title': metadata.get('title'),
            'organism': organism,
            'n_samples': metadata.get('n_samples'),
            'sra_study': sra_study,
            'n_runs': len(runs),
            'groups': {k: {**v, 'runs': None, 'gsm_ids': list(v.get('gsm_ids', []))} for k, v in groups.items()},
            'suggested_genome': genome,
            'suggested_pipeline': pipeline,
            'summary': metadata.get('summary'),
        }
        output_path = Path(args.output_json)
        with open(output_path, 'w') as f:
            json.dump(info, f, indent=2)
        print(f"\n📄 Study info saved to: {output_path}")

    return 0


def cmd_groups(args):
    """Display sample groups in a study for interactive selection."""
    geo_id = args.geo_id.upper()

    print(f"\nFetching sample groups for {geo_id}...")

    # Get detailed run info
    runs = fetch_sra_run_info_detailed(geo_id)
    if not runs:
        runs = fetch_sra_run_info(geo_id)

    if not runs:
        print(f"\n❌ No runs found for {geo_id}")
        return 1

    # Group samples
    groups = group_samples_by_type(runs)

    print(format_sample_groups_table(groups))

    # Output for interactive selection
    print("\n📋 Available groups for --subset option:")
    for i, (key, info) in enumerate(sorted(groups.items(), key=lambda x: -x[1]['count']), 1):
        size_str = format_file_size(info['size_estimate'])
        print(f"  {i}. \"{key}\" - {info['count']} samples (~{size_str})")

    # Save to JSON if requested
    if args.output:
        output_path = Path(args.output)
        output_data = {
            'geo_id': geo_id,
            'groups': {}
        }
        for key, info in groups.items():
            output_data['groups'][key] = {
                'count': info['count'],
                'gsm_range': info['gsm_range'],
                'gsm_ids': info.get('gsm_ids', []),
                'size_estimate': info['size_estimate'],
                'strategy': info['strategy'],
                'layout': info['layout'],
                'srr_ids': [r['srr'] for r in info['runs']],
            }
        with open(output_path, 'w') as f:
            json.dump(output_data, f, indent=2)
        print(f"\n📄 Groups saved to: {output_path}")

    return 0


def cmd_list(args):
    """List all samples and runs in a study."""
    geo_id = args.geo_id.upper()

    print(f"\nFetching run list for {geo_id}...")

    runs = fetch_sra_run_info(geo_id)
    if not runs:
        print(f"\n❌ No runs found for {geo_id}")
        return 1

    # Apply filter if specified
    if args.filter:
        filter_parts = args.filter.split(':')
        strategy_filter = filter_parts[0].upper() if filter_parts else None
        layout_filter = filter_parts[1].upper() if len(filter_parts) > 1 else None

        filtered = []
        for run in runs:
            if strategy_filter and run.get('library_strategy', '').upper() != strategy_filter:
                continue
            if layout_filter and run.get('layout', '').upper() != layout_filter:
                continue
            filtered.append(run)
        runs = filtered

    print(f"\n{'SRR':<15} {'GSM':<12} {'Layout':<8} {'Strategy':<12} {'Size':>10}")
    print("-" * 60)

    for run in runs:
        size = format_file_size(run.get('bases', 0) // 4)
        print(f"{run['srr']:<15} {run.get('gsm', 'N/A'):<12} {run.get('layout', 'N/A'):<8} "
              f"{run.get('library_strategy', 'N/A'):<12} {size:>10}")

    print(f"\nTotal: {len(runs)} runs")

    # Output as TSV if requested
    if args.output:
        output_path = Path(args.output)
        with open(output_path, 'w') as f:
            f.write("run_accession\tgsm\tlayout\tlibrary_strategy\tbases\n")
            for run in runs:
                f.write(f"{run['srr']}\t{run.get('gsm', '')}\t{run.get('layout', '')}\t"
                        f"{run.get('library_strategy', '')}\t{run.get('bases', 0)}\n")
        print(f"\n📄 Run list saved to: {output_path}")

    return 0


def download_fastq_file(url: str, output_path: Path, timeout: int = 600) -> Tuple[str, bool]:
    """Download a single FASTQ file."""
    filename = output_path.name
    if output_path.exists():
        return filename, True  # Already exists

    success = download_file(url, output_path, timeout=timeout, show_progress=False)
    return filename, success


def interactive_select_group(groups: Dict[str, Dict]) -> Optional[str]:
    """Interactively select a sample group."""
    if len(groups) <= 1:
        return None  # No selection needed

    print("\n" + "=" * 60)
    print("  SELECT SAMPLE GROUP TO DOWNLOAD")
    print("=" * 60)

    sorted_groups = sorted(groups.items(), key=lambda x: -x[1]['count'])

    for i, (key, info) in enumerate(sorted_groups, 1):
        size_str = format_file_size(info['size_estimate'])
        print(f"\n  [{i}] {info['strategy']} ({info['layout'].lower()})")
        print(f"      Samples: {info['count']}")
        print(f"      GSM: {info['gsm_range']}")
        print(f"      Size: ~{size_str}")

    print(f"\n  [0] Download ALL ({sum(g['count'] for g in groups.values())} samples)")
    print("-" * 60)

    try:
        choice = input("\nEnter selection (0-{}): ".format(len(sorted_groups))).strip()
        choice_num = int(choice)

        if choice_num == 0:
            return None  # Download all
        elif 1 <= choice_num <= len(sorted_groups):
            selected_key = sorted_groups[choice_num - 1][0]
            print(f"\n✓ Selected: {selected_key}")
            return selected_key
        else:
            print("Invalid selection, downloading all.")
            return None
    except (ValueError, EOFError, KeyboardInterrupt):
        print("\nInvalid input, downloading all.")
        return None


def cmd_download(args):
    """Download FASTQ files from ENA."""
    geo_id = args.geo_id.upper()
    output_dir = Path(args.output)
    output_dir.mkdir(parents=True, exist_ok=True)

    print(f"\nPreparing download for {geo_id}...")

    # Get detailed run info (includes BioProject fallback for SuperSeries)
    print("Fetching SRA run information...")
    runs = fetch_sra_run_info_detailed(geo_id)
    if not runs:
        runs = fetch_sra_run_info(geo_id)

    if not runs:
        print(f"❌ No runs found for {geo_id}")
        return 1

    # Collect all unique SRA studies from runs (SuperSeries may have multiple)
    sra_studies = set(r.get('sra_study', '') for r in runs if r.get('sra_study'))
    if not sra_studies:
        print(f"❌ Could not find any SRA studies for {geo_id}")
        return 1

    if len(sra_studies) > 1:
        print(f"SuperSeries detected with {len(sra_studies)} SRA studies: {', '.join(sorted(sra_studies))}")
    else:
        print(f"SRA Study: {list(sra_studies)[0]}")

    # Group samples
    groups = group_samples_by_type(runs)

    # Show sample groups if multiple types exist
    if len(groups) > 1:
        print(format_sample_groups_table(groups))

    # Handle subset selection
    selected_subset = args.subset

    # Interactive mode if multiple groups and no subset specified
    if args.interactive and len(groups) > 1 and not selected_subset:
        selected_subset = interactive_select_group(groups)

    # Get ENA FASTQ URLs from all SRA studies
    print("\nFetching FASTQ URLs from ENA...")
    fastq_urls = {}
    for sra_study in sorted(sra_studies):
        study_urls = fetch_ena_fastq_urls(sra_study)
        if study_urls:
            print(f"  {sra_study}: {len(study_urls)} runs")
            fastq_urls.update(study_urls)

    if not fastq_urls:
        print("❌ No FASTQ URLs found in ENA")
        print("Tip: Try using SRA toolkit directly with prefetch + fasterq-dump")
        return 1

    # Apply filter if specified
    if selected_subset:
        filter_parts = selected_subset.split(':')
        strategy_filter = filter_parts[0].upper() if filter_parts else None
        layout_filter = filter_parts[1].upper() if len(filter_parts) > 1 else None

        filtered_srrs = set()
        for run in runs:
            if strategy_filter and run.get('library_strategy', '').upper() != strategy_filter:
                continue
            if layout_filter and run.get('layout', '').upper() != layout_filter:
                continue
            filtered_srrs.add(run['srr'])

        fastq_urls = {srr: urls for srr, urls in fastq_urls.items() if srr in filtered_srrs}
        print(f"\n📦 Filtered to {len(fastq_urls)} runs matching \"{selected_subset}\"")

    # Count files to download
    total_files = sum(len(urls) for urls in fastq_urls.values())
    print(f"\n📦 Found {len(fastq_urls)} runs, {total_files} FASTQ files to download")

    # Check for existing files
    existing = 0
    downloads_needed = []
    for srr, urls in fastq_urls.items():
        for url in urls:
            filename = url.split('/')[-1]
            filepath = output_dir / filename
            if filepath.exists():
                existing += 1
            else:
                downloads_needed.append((url, filepath))

    if existing:
        print(f"  ✓ {existing} files already exist, skipping")

    if not downloads_needed:
        print("\n✅ All files already downloaded!")
        return 0

    print(f"  ↓ {len(downloads_needed)} files to download")
    print()

    # Download files
    successful = 0
    failed = []

    if args.parallel > 1:
        # Parallel download
        with ThreadPoolExecutor(max_workers=args.parallel) as executor:
            futures = {
                executor.submit(download_fastq_file, url, filepath): filepath
                for url, filepath in downloads_needed
            }

            for i, future in enumerate(as_completed(futures), 1):
                filepath = futures[future]
                filename, success = future.result()
                status = "✓" if success else "✗"
                print(f"  [{i}/{len(downloads_needed)}] {status} {filename}")
                if success:
                    successful += 1
                else:
                    failed.append(filename)
    else:
        # Sequential download
        for i, (url, filepath) in enumerate(downloads_needed, 1):
            filename = filepath.name
            print(f"  [{i}/{len(downloads_needed)}] Downloading {filename}...")
            success = download_file(url, filepath, timeout=args.timeout)
            if success:
                successful += 1
                print(f"    ✓ Done")
            else:
                failed.append(filename)
                print(f"    ✗ Failed")

    print(f"\n📊 Download summary:")
    print(f"  ✓ Successful: {successful + existing}")
    print(f"  ✗ Failed: {len(failed)}")

    if failed:
        print(f"\nFailed downloads:")
        for f in failed:
            print(f"  - {f}")
        return 1

    print(f"\n✅ All files downloaded to: {output_dir}")

    # Save metadata
    metadata_path = output_dir / "download_metadata.json"
    metadata = {
        'geo_id': geo_id,
        'sra_studies': sorted(sra_studies),
        'n_runs': len(fastq_urls),
        'n_files': total_files,
        'output_dir': str(output_dir.absolute()),
    }
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=2)

    return 0


def cmd_samplesheet(args):
    """Generate samplesheet for nf-core pipeline."""
    geo_id = args.geo_id.upper()
    fastq_dir = Path(args.fastq_dir)
    output_path = Path(args.output)

    print(f"\nGenerating samplesheet for {geo_id}...")

    # Get run info
    runs = fetch_sra_run_info(geo_id)
    if not runs:
        print(f"❌ No runs found for {geo_id}")
        return 1

    # Get GEO metadata for sample naming
    metadata = fetch_geo_metadata(geo_id)
    organism = metadata.get('organism', 'Unknown') if metadata else 'Unknown'
    genome = suggest_genome(organism)

    # Detect pipeline from data
    strategies = set(r.get('library_strategy', 'RNA-SEQ') for r in runs)
    primary_strategy = list(strategies)[0] if strategies else 'RNA-SEQ'
    pipeline = args.pipeline or suggest_pipeline(primary_strategy)

    # Map SRR to local FASTQ files
    samples = []
    for run in runs:
        srr = run['srr']
        layout = run.get('layout', 'PAIRED')

        # Find FASTQ files
        if layout == 'PAIRED':
            r1 = fastq_dir / f"{srr}_1.fastq.gz"
            r2 = fastq_dir / f"{srr}_2.fastq.gz"
            if not r1.exists() or not r2.exists():
                logger.warning(f"FASTQ files not found for {srr}")
                continue
            samples.append({
                'srr': srr,
                'gsm': run.get('gsm', ''),
                'fastq_1': str(r1.absolute()),
                'fastq_2': str(r2.absolute()),
                'layout': 'PAIRED',
            })
        else:
            r1 = fastq_dir / f"{srr}.fastq.gz"
            if not r1.exists():
                r1 = fastq_dir / f"{srr}_1.fastq.gz"
            if not r1.exists():
                logger.warning(f"FASTQ file not found for {srr}")
                continue
            samples.append({
                'srr': srr,
                'gsm': run.get('gsm', ''),
                'fastq_1': str(r1.absolute()),
                'fastq_2': '',
                'layout': 'SINGLE',
            })

    if not samples:
        print(f"❌ No FASTQ files found in {fastq_dir}")
        return 1

    # Generate sample names
    # Try to infer meaningful names from GSM IDs or use SRR
    sample_names = {}
    for sample in samples:
        # Default to SRR accession
        sample_names[sample['srr']] = sample['srr']

    # Write samplesheet
    with open(output_path, 'w') as f:
        if pipeline == 'rnaseq':
            f.write("sample,fastq_1,fastq_2,strandedness\n")
            for sample in samples:
                name = sample_names[sample['srr']]
                f.write(f"{name},{sample['fastq_1']},{sample['fastq_2']},auto\n")
        elif pipeline == 'atacseq':
            f.write("sample,fastq_1,fastq_2,replicate\n")
            for i, sample in enumerate(samples, 1):
                name = sample_names[sample['srr']]
                f.write(f"{name},{sample['fastq_1']},{sample['fastq_2']},1\n")
        else:
            # Generic format
            f.write("sample,fastq_1,fastq_2\n")
            for sample in samples:
                name = sample_names[sample['srr']]
                f.write(f"{name},{sample['fastq_1']},{sample['fastq_2']}\n")

    print(f"\n✅ Generated samplesheet: {output_path}")
    print(f"   Samples: {len(samples)}")
    print(f"   Pipeline: nf-core/{pipeline}")
    if genome:
        print(f"   Genome: {genome}")

    print(f"\n💡 Suggested command:")
    print(f"   nextflow run nf-core/{pipeline} \\")
    print(f"       --input {output_path} \\")
    print(f"       --outdir results \\")
    if genome:
        print(f"       --genome {genome} \\")
    print(f"       -profile docker")

    return 0


def main():
    parser = argparse.ArgumentParser(
        description="Download GEO/SRA data and prepare for nf-core pipelines",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  %(prog)s info GSE110004                    # Get study info with sample groups
  %(prog)s groups GSE110004                  # Show sample groups for selection
  %(prog)s list GSE110004 --filter RNA-Seq   # List RNA-seq runs
  %(prog)s download GSE110004 -o ./fastq -i  # Download with interactive selection
  %(prog)s download GSE110004 -o ./fastq --subset "RNA-Seq:PAIRED"
  %(prog)s samplesheet GSE110004 \\
      --fastq-dir ./fastq -o samplesheet.csv # Generate samplesheet
        """
    )

    subparsers = parser.add_subparsers(dest='command', help='Commands')

    # info command
    info_parser = subparsers.add_parser('info', help='Display study information with sample groups')
    info_parser.add_argument('geo_id', help='GEO accession (e.g., GSE110004)')
    info_parser.add_argument('--output-json', '-o', help='Save info to JSON file')

    # groups command
    groups_parser = subparsers.add_parser('groups', help='Show sample groups for interactive selection')
    groups_parser.add_argument('geo_id', help='GEO accession')
    groups_parser.add_argument('--output', '-o', help='Save groups to JSON file')

    # list command
    list_parser = subparsers.add_parser('list', help='List samples and runs')
    list_parser.add_argument('geo_id', help='GEO accession')
    list_parser.add_argument('--filter', '-f', help='Filter by strategy:layout (e.g., RNA-Seq:PAIRED)')
    list_parser.add_argument('--output', '-o', help='Save to TSV file')

    # download command
    dl_parser = subparsers.add_parser('download', help='Download FASTQ files')
    dl_parser.add_argument('geo_id', help='GEO accession')
    dl_parser.add_argument('--output', '-o', required=True, help='Output directory')
    dl_parser.add_argument('--subset', '-s', help='Filter subset (e.g., RNA-Seq:PAIRED)')
    dl_parser.add_argument('--interactive', '-i', action='store_true',
                           help='Interactively select sample group to download')
    dl_parser.add_argument('--parallel', '-p', type=int, default=4, help='Parallel downloads')
    dl_parser.add_argument('--timeout', '-t', type=int, default=600, help='Download timeout (sec)')

    # samplesheet command
    ss_parser = subparsers.add_parser('samplesheet', help='Generate samplesheet')
    ss_parser.add_argument('geo_id', help='GEO accession')
    ss_parser.add_argument('--fastq-dir', '-f', required=True, help='Directory with FASTQ files')
    ss_parser.add_argument('--output', '-o', default='samplesheet.csv', help='Output samplesheet')
    ss_parser.add_argument('--pipeline', '-p', help='Target pipeline (auto-detected if not specified)')

    args = parser.parse_args()

    if not args.command:
        parser.print_help()
        return 1

    commands = {
        'info': cmd_info,
        'groups': cmd_groups,
        'list': cmd_list,
        'download': cmd_download,
        'samplesheet': cmd_samplesheet,
    }

    return commands[args.command](args)


if __name__ == '__main__':
    sys.exit(main())

Install this Skill

Skills give your AI agent a consistent, structured approach to this task — better output than a one-off prompt.

npx skills add anthropics/knowledge-work-plugins --skill bio-research

Download ZIP

Official Anthropic skill. Need a walkthrough? See the install guide →

Works with

Claude Code Claude.ai

No terminal needed — Claude.ai works by pasting the skill into custom instructions.

Details

Category: Research
License: Apache 2.0
Author: @anthropics
Source: GitHub →
Source file: show path
bio-research/skills/nextflow-development/SKILL.md

bio-research nextflow-development

Bio Research Nextflow Development

nf-core Pipeline Deployment

Workflow Checklist

Step 0: Acquire Data (GEO/SRA Only)

Step 1: Environment Check

Docker issues

Nextflow issues

Java issues

Step 2: Select Pipeline

Step 3: Run Test Profile

Step 4: Create Samplesheet

Generate automatically

Validate existing samplesheet

Samplesheet formats

Step 5: Configure & Run

5a. Check genome availability

5b. Decision points

5c. Run pipeline

Step 6: Verify Outputs

Check completion

Key outputs by pipeline

Quick Reference

Resume failed run

References

Disclaimer

Attribution

Licenses

GEO/SRA Data Acquisition

Table of Contents

Workflow Overview

Instructions for Claude

Step 1: Fetch Study Information

Step 2: Review Sample Groups

Step 3: Download FASTQ Files

Interactive Mode (Recommended)

Direct Subset Selection

Step 4: Generate Samplesheet

Step 5: Run nf-core Pipeline

Supported Pipelines

Supported Organisms

Complete Example

Alternative: Non-interactive Download

Troubleshooting

ENA Download Fails

No SRA Runs Found

SuperSeries Support

Genome Not Recognized

Requirements

Installation

Contents

Quick install

Docker setup

Linux

macOS

Verify

Singularity setup (HPC)

Configure cache

nf-core tools (optional)

Verify installation

Common issues

Troubleshooting

Contents

Exit Codes

HPC/Singularity Issues

Singularity cache issues

Using Singularity instead of Docker

Pipeline Failures

Container pull failed

"No such file" errors

Resume not working

RNA-seq Specific

STAR index fails

Low alignment rate

Strandedness detection fails

Sarek Specific

BQSR fails

Mutect2 no variants

Out of memory for WGS

DeepVariant GPU issues

ATAC-seq Specific