-
Notifications
You must be signed in to change notification settings - Fork 1
Description
实际问题
Issue 1: DRY Violation - Duplicated Atomic Number Mapping Across 20+ Files
文件:
pymultiwfn/io/parsers/wfn.py:229pymultiwfn/io/parsers/molden.py:181pymultiwfn/io/parsers/xyz.py:70pymultiwfn/io/parsers/fchk.py(and 16+ other parser files)
代码 (重复出现在多个文件中):
# In wfn.py:229
symbol_to_number = {
\"H\": 1,
\"He\": 2,
\"Li\": 3,
... # 36 entries
}
# In molden.py:181
symbol_to_number = {
\"H\": 1,
\"HE\": 2,
\"LI\": 3,
... # Same data, different case
}
# In xyz.py:70
element_mapping = {
\"H\": 1,
\"He\": 2,
... # Same data, different variable name
}问题: The same atomic number mapping is copy-pasted across 20+ parser files. This violates DRY (Don't Repeat Yourself) principle and makes maintenance difficult - any update to element data requires changes in multiple places.
改进建议:
# In pymultiwfn/core/definitions.py (ALREADY EXISTS!)
ELEMENT_NAME_TO_Z = {
name.strip().upper(): i
for i, name in enumerate(ELEMENT_NAMES)
if name.strip() != "?? "
}
# In parsers, simply import and use:
from pymultiwfn.core.definitions import ELEMENT_NAME_TO_Z
# Single lookup function for all parsers
def get_atomic_number(symbol: str) -> int:
\"\"\"Get atomic number from element symbol (case-insensitive).\"\"\"
return ELEMENT_NAME_TO_Z.get(symbol.strip().upper(), 0)Issue 2: Overly Complex ParserFactory._detect_from_content Method
文件: pymultiwfn/io/parsers/factory.py:146-268
代码 (节选):
@classmethod
def _detect_from_content(cls, filename: str) -> Optional[Type]:
# ... 100+ lines of repetitive detection logic ...
# Check for Gaussian formatted checkpoint
if \"Gaussian\" in content and \"Formatted Checkpoint\" in content:
return FchkLoader
# Check for Molden format
if \"[Molden Format]\" in content or \"[Atoms]\" in content:
return MoldenLoader
# Check for WFN format
if any(line and line.split()[0].isdigit() for line in first_lines if line):
# ... 10 lines of nested logic ...
return WFNLoader
# ... 15+ more similar checks ...
# Check for PDB format
if any(line.startswith((\"ATOM\", \"HETATM\")) for line in first_lines):
return PDBLoader
# Check for Cube format
if len(first_lines) >= 2 and first_lines[1].strip().isdigit():
# ... 5 lines of nested logic ...
return CubeLoader
# ... and so on for 20+ formats问题: The content detection logic is verbose and repetitive. Each format check follows the same pattern but is implemented with slight variations, making the code hard to read and extend.
改进建议:
# Simple, declarative format detection
FORMAT_SIGNATURES = [
(FchkLoader, lambda lines, content:
\"Gaussian\" in content and \"Formatted Checkpoint\" in content),
(MoldenLoader, lambda lines, content:
\"[Molden Format]\" in content or \"[Atoms]\" in content),
(WFNLoader, lambda lines, content:
_is_wfn_format(lines)),
(PDBLoader, lambda lines, content:
any(l.startswith((\"ATOM\", \"HETATM\")) for l in lines)),
# ... etc
]
def _detect_from_content(cls, filename: str) -> Optional[Type]:
first_lines = _read_first_lines(filename)
content = \"\\n\".join(first_lines)
for loader_class, check_fn in FORMAT_SIGNATURES:
if check_fn(first_lines, content):
return loader_class
return NoneIssue 3: Code Duplication in Bond Order Calculations
文件: pymultiwfn/analysis/bonding/bondorder.py
代码 (calculate_mayer_bond_order and calculate_mulliken_bond_order):
def calculate_mayer_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
# ... validation code ...
for i in range(n_atoms):
bfs_i = atom_to_bfs.get(i, [])
if not bfs_i:
continue
for j in range(i + 1, n_atoms):
bfs_j = atom_to_bfs.get(j, [])
if not bfs_j:
continue
# ... calculation ...
def calculate_mulliken_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
# ... validation code ...
for i in range(n_atoms):
bfs_i = atom_to_bfs.get(i, [])
if not bfs_i:
continue
for j in range(i + 1, n_atoms):
bfs_j = atom_to_bfs.get(j, [])
if not bfs_j:
continue
# ... calculation ...问题: The atom pair iteration pattern is duplicated between Mayer and Mulliken bond order calculations. The only difference is the actual computation inside the inner loop.
改进建议:
def _iterate_atom_pairs(wfn: Wavefunction):
\"\"\"Generator to yield (i, j, bfs_i, bfs_j) for atom pairs.\"\"\"
atom_to_bfs = wfn.get_atomic_basis_indices()
n_atoms = wfn.num_atoms
for i in range(n_atoms):
bfs_i = atom_to_bfs.get(i, [])
if not bfs_i:
continue
for j in range(i + 1, n_atoms):
bfs_j = atom_to_bfs.get(j, [])
if not bfs_j:
continue
yield i, j, bfs_i, bfs_j
def calculate_mayer_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
PS_total = wfn.Ptot @ wfn.overlap_matrix
for i, j, bfs_i, bfs_j in _iterate_atom_pairs(wfn):
ps_ij = PS_total[np.ix_(bfs_i, bfs_j)]
ps_ji = PS_total[np.ix_(bfs_j, bfs_i)]
accum = np.trace(ps_ij @ ps_ji)
# ... store result ...
def calculate_mulliken_bond_order(wfn: Wavefunction) -> Dict[str, np.ndarray]:
PS_total = wfn.Ptot * wfn.overlap_matrix
for i, j, bfs_i, bfs_j in _iterate_atom_pairs(wfn):
bnd_total[i, j] = np.sum(PS_total[np.ix_(bfs_i, bfs_j)])Issue 4: Massive if-elif Chain in overlap.py
文件: pymultiwfn/integrals/overlap.py:388-441
代码:
def _type_to_lmn(gto_type: int) -> Tuple[int, int, int]:
# S and P functions
if gto_type == 0:
return (0, 0, 0)
elif gto_type == 1:
return (1, 0, 0)
elif gto_type == 2:
return (0, 1, 0)
elif gto_type == 3:
return (0, 0, 1)
# D functions
elif gto_type == 4:
return (2, 0, 0)
elif gto_type == 5:
return (0, 2, 0)
elif gto_type == 6:
return (0, 0, 2)
elif gto_type == 7:
return (1, 1, 0)
elif gto_type == 8:
return (1, 0, 1)
elif gto_type == 9:
return (0, 1, 1)
# ... more elifs for F functions
else:
raise NotImplementedError(f\"GTO type {gto_type} not yet implemented\")问题: 20+ lines of if-elif chains for a simple lookup operation. This is harder to read and maintain than a lookup table.
改进建议:
# Single lookup table - easy to read, easy to maintain
GTO_TYPE_TO_LMN = {
0: (0, 0, 0), # S
1: (1, 0, 0), # Px
2: (0, 1, 0), # Py
3: (0, 0, 1), # Pz
4: (2, 0, 0), # Dxx
5: (0, 2, 0), # Dyy
6: (0, 0, 2), # Dzz
7: (1, 1, 0), # Dxy
8: (1, 0, 1), # Dxz
9: (0, 1, 1), # Dyz
# ... etc
}
def _type_to_lmn(gto_type: int) -> Tuple[int, int, int]:
try:
return GTO_TYPE_TO_LMN[gto_type]
except KeyError:
raise NotImplementedError(f\"GTO type {gto_type} not yet implemented\")Issue 5: Complex WFN Loader with Over-Engineered Fallback Logic
文件: pymultiwfn/io/parsers/wfn.py:405-587
代码:
def _parse_basis_and_mo(self, lines):
# ... 180+ lines of complex parsing logic ...
# Multiple nested data structures
shells_dict = {}
# Extensive debug print statements mixed with logic
print(f\"[DEBUG] Basis function information...\")
# Complex type mapping with multiple conversion layers
if shell_type == 1:
our_type = 0 # S
elif shell_type in [2, 3, 4]:
our_type = 1 # P
elif shell_type in [5, 6, 7, 8, 9, 10]:
our_type = 2 # D
# ... more conversions
# Multiple fallback strategies
try:
# Try full overlap calculation
overlap_matrix = self._calculate_wfn_overlap_matrix()
except Exception as e:
warnings.warn(f\"Full overlap calculation failed: {e}\")
# Fallback to distance-based
overlap_matrix = self._calculate_distance_based_overlap(basis_functions)问题: The WFN parser is unnecessarily complex with debug prints mixed with logic, multiple fallback layers, and convoluted type conversions. This makes debugging difficult.
改进建议:
- Separate parsing logic from debug output
- Simplify type conversions using lookup tables
- Remove or consolidate fallback strategies
- Add clear separation between parsing and post-processing
为什么重要
| Issue | Impact |
|---|---|
| DRY Violation | Maintenance burden - updating element data requires changes in 20+ files; risk of inconsistencies |
| Complex Factory | Hard to add new formats; error-prone; difficult to test individual format detection |
| Duplicated Bond Logic | Code bloat; harder to optimize; inconsistent behavior risk |
| if-elif Chain | Slower execution; harder to extend for new GTO types |
| Over-Engineered WFN | Harder to debug; unclear data flow; technical debt |
引用 KISS 原则: "Simplicity is the ultimate sophistication." - Leonardo da Vinci
These issues violate the KISS principle by introducing unnecessary complexity where simple solutions exist.
优先级
- 高 - 影响系统稳定性/安全性 (DRY violations can cause data inconsistencies)
- 中 - 影响代码质量 (All issues)
- 低 - 改进建议
建议修复顺序
- Issue 1 (DRY): Centralize atomic number lookup using existing
ELEMENT_NAME_TO_Z - Issue 4 (if-elif): Convert to lookup table - simplest change with clear benefit
- Issue 3 (Bond orders): Extract common iteration pattern
- Issue 2 (Factory): Refactor to declarative format signatures
- Issue 5 (WFN): Gradual simplification of parsing logic