Changes between Version 1 and Version 2 of Ticket #6647, comment 23


Ignore:
Timestamp:
05/07/2018 03:52:34 PM (14 months ago)
Author:
sebastian
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #6647, comment 23

    v1 v2  
    11Awesome, this seems to be even more a reason to expect domains in filters to use punycode. 
    22 
    3 I hacked together a Python script that performs the conversion on any filter list: 
    4  
    5 {{{ 
    6 import sys 
    7 import re 
    8 import io 
    9  
    10 def idna(s): 
    11     try: 
    12         return s.encode('idna').decode('ascii') 
    13     except UnicodeError: 
    14         return s 
    15  
    16 def convert_domains(domains): 
    17     result = [] 
    18     for domain in domains: 
    19         if domain.startswith('~'): 
    20             converted = '~' + idna(domain[1:]) 
    21         else: 
    22             converted = idna(domain) 
    23         if domain == converted or converted not in domains: 
    24             result.append(converted) 
    25     return result 
    26  
    27 def strip(s): 
    28     return re.sub(r'\s', '', s) 
    29  
    30 for filename in sys.argv[1:]: 
    31     with io.open(filename, 'r+', encoding='utf-8') as file: 
    32         lines = file.readlines() 
    33         i = 0 
    34         while i < len(lines): 
    35             line = lines[i] 
    36             if not line.lstrip().startswith('!'): 
    37                 m = re.search(r'^(.*?)#[@?]?#', line) 
    38                 if m: 
    39                     domains = ','.join(convert_domains(m.group(1).split(','))) 
    40                     converted = domains + line[m.end(1):] 
    41                 else: 
    42                     converted = re.sub( 
    43                         r'^([\s@|]*)([^*/^#$\r\n]+)', 
    44                         lambda m: m.group(1) + idna(m.group(2)), 
    45                         line 
    46                     ) 
    47                     converted = re.sub( 
    48                         r'(\$(?:\w*,)?domain=)([^,\r\n]+)', 
    49                         lambda m: m.group(1) + '|'.join(convert_domains(m.group(2).split('|'))), 
    50                         converted 
    51                     ) 
    52  
    53                 if converted != line and strip(converted) in map(strip, lines): 
    54                     del lines[i] 
    55                     continue 
    56  
    57                 lines[i] = converted 
    58             i += 1 
    59  
    60         file.seek(0) 
    61         file.writelines(lines) 
    62         file.truncate() 
    63 }}} 
     3I hacked together a [attachment:convert_domains.py​ Python script] that performs the conversion on any filter list. 
    644 
    655I uploaded diffs with changes produced by the script for [attachment:easylist.txt.diff​ EasyList] and [attachment:advblock.txt.diff RU AdList] For filters/domains that are given in both punycode and unicode, the script just removed the unicode variant.