Alternative attachment filter for the Python filtering architecture of Courier MTA

Courier-MTA has a generic filtering interface different than sendmail's milter. Among the available software, there is a Python-based architecture which you can find on pypi. It includes an attachments.py that the filter presented here is alternative to. You must install that filter before trying this.

Differences and other dependencies

The original attachments.py uses libarchive-c if present, but doesn't require it. The present filter does. (Mind a Python package with a similar name whose namespace collides with libarchive and hence is difficult to recognize once installed.) In addition, this filter requires oletools, a Python library to analyze OLE and MS Office files. They are both listed in requirements.txt.

This filter won't block messages destined exclusively to abuse@ mailboxes. That's meant to let abuse teams receive complaints. You may want to alter it (search can_pass), which is not quite a trendy way to maintain software —see below.

Install

Careful with that pip3 install as it will try and install courier-pythonfilter. If it's already installed and you're in a virtualenv, you need no sudo.

sudo pip3 install -r "http://www.tana.it/svn/pyfilters/trunk/requirements.txt"

curl -O "http://www.tana.it/svn/pyfilters/trunk/attachments3.py"
python -m compileall -l "attachments3.py"
sudo mv -i __pycache__/* /usr/local/lib/python3.7/dist-packages/pythonfilter/__pycache__

sudo courierfilter stop
sudo courierfilter start

Check the /python3.7/ destination directory is correct! If you'd like to consider a pythonic install, please see below.

Source

001: #!/usr/bin/python3
002: " attachments -- Courier filter which blocks specified attachment types"
003: # Copyright (C) 2005-2008  Robert Penz <robert@penz.name>
004: # hacked (H) 2017-2021 ale
005: #
020: 
021: import sys
022: from email.message import EmailMessage, _unquotevalue
023: import email.utils
024: from email.policy import EmailPolicy
025: from email.headerregistry import HeaderRegistry as HeaderRegistry
026: import binascii
027: 
028: # this is libarchive-c
029: import libarchive
030: 
031: import oletools.olevba
032: from oletools.mraptor import MacroRaptor
033: from oletools import rtfobj
034: 
035: # added 20 Apr 2020
036: from oletools.ooxml import XmlParser
037: from oletools.oleobj import find_external_relationships
038: from zipfile import is_zipfile
039: 
040: from io import BytesIO
041: 
042: 
043: # for debugging:
044: import traceback
045: 
046: # Extensions.  Assume any extension appears in at most one list.
047: # Each list has a different treatment.
048: # Maintain:
049: # $ i=0; for e in $(sort < temp |uniq ); do printf " '%s'," $e; if [ $((++i % 8)) -eq 0 ]; then printf "\n"; fi; done; printf "\n"
050: 
051: # https://support.google.com/mail/answer/6590?hl=en
052: # http://www.theverge.com/2017/1/25/14391462/gmail-javascript-block-file-attachments-malware-security
053: # https://kb.intermedia.net/Article/23567
054: 
055: blocked_extensions = (
056:  '.acc', '.ade', '.adp', '.asp', '.bat', '.ccs', '.chm', '.class',
057:  '.cmd', '.com', '.cpl', '.dll', '.dmg', '.drv', '.exe', '.grp',
058:  '.hlp', '.hta', '.htx', '.ins', '.isp', '.jar', '.je', '.js',
059:  '.jse', '.lib', '.lnk', '.mde', '.msc', '.msh', '.msh1', '.msh1xml',
060:  '.msh2', '.msh2xml', '.mshxml', '.msi', '.msp', '.mst', '.ocx', '.ovl',
061:  '.pcd', '.php', '.php3', '.pif', '.ps1', '.ps1xml', '.ps2', '.ps2xml',
062:  '.psc1', '.psc2', '.reg', '.sbs', '.scr', '.sct', '.shb', '.shd',
063:  '.shs', '.sys', '.vb', '.vba', '.vbe', '.vbs', '.vdl', '.vxd',
064:  '.ws', '.wsc', '.wsf', '.wsh', '.wst')
065: 
066: 
067: # extensions supported by VBA_parser, see also
068: # https://en.wikipedia.org/wiki/List_of_Microsoft_Office_filename_extensions
069: # https://datatypes.net/open-ade-files
070: # https://docs.microsoft.com/en-us/deployoffice/security/block-specific-file-format-types-in-office
071: # office_extensions = (
072: #  '.doc', '.dot',   #- Word 97-2003
073: #  '.docm', '.dotm', #- Word 2007+
074: #  '.xml',           #- Word 2003 XML
075: #  '.mht',           #- Word MHT - Single File Web Page / MHTML
076: #  '.xls',           #- Excel 97-2003
077: #  '.xlsm', '.xlsb', #- Excel 2007+
078: #  '.ppt',           #- PowerPoint 97-2003
079: #  '.pptm', '.ppsm') #- PowerPoint 2007+
080: 
081: office_extensions = (
082:  '.accda', '.accdb', '.accde', '.accdr', '.accdt', '.ade', '.adn', '.adp',
083:  '.cdb', '.doc', '.docb', '.docm', '.docx', '.dot', '.dotm', '.dotx',
084:  '.htm', '.html', '.laccdb', '.ldb', '.maf', '.mam', '.maq', '.mar',
085:  '.mat', '.mda', '.mdb', '.mde', '.mdf', '.mdn', '.mdt', '.mdw',
086:  '.mht', '.mhtml', '.ods', '.pot', '.potm', '.potx', '.ppam', '.ppax',
087:  '.pps', '.ppsm', '.ppsx', '.ppt', '.pptm', '.pptx', '.rtf', '.sldm',
088:  '.sldx', '.thmx', '.wbk', '.wiz', '.xla', '.xlam', '.xlb', '.xlcxlk',
089:  '.xll', '.xlm', '.xlmss', '.xls', '.xlsb', '.xlsm', '.xlsx', '.xlt',
090:  '.xltm', '.xltx', '.xlw')
091: 
092: 
093: # extensions implemented as a zip container have a variety of media
094: # files, but macro are still implemented as OLE containers.
095: # See 'Heuristic' below.
096: # https://www.codeproject.com/Articles/15216/Office-2007-bin-file-format
097: # https://kb.intermedia.net/Article/23567
098: 
099: 
100: 
101: # https://en.wikipedia.org/wiki/List_of_archive_formats but must be in
102: # https://github.com/libarchive/libarchive/wiki/ManPageLibarchiveFormats5
103: archive_extensions = ('.zip', '.tar.gz', '.tgz', '.tar.Z', '.tar.bz2',
104:    '.tbz2', '.tar.lzma', '.tlz', '.7z', '.ace', '.rar')
105: 
106: 
107: # TODO: detect documents with TargetMode="External" and DDE, see:
108: # http://staaldraad.github.io/2017/10/23/msword-field-codes/
109: 
110: 
111: def de_comment(field):
112:    """Parse a header field fragment and remove comments.
113: 
114:    copied from AddrlistClass.getdelimited() in email/_parseaddr.py
115:    """
116: 
117:    slist = ['']
118:    quote = False
119:    pos = 0
120:    depth = 0
121:    while pos < len(field):
122:       if quote:
123:          quote = False
124:       elif field[pos] == '(':
125:          depth += 1
126:       elif field[pos] == ')':
127:          depth = max(depth - 1, 0)
128:          pos += 1
129:          continue
130:       elif field[pos] == '\\':
131:          quote = True
132:       if depth == 0:
133:          slist.append(field[pos])
134:       pos += 1
135: 
136:    return ''.join(slist)
137: 
138: def is_quoted(value):
139:    """ Check whether a value (string or tuple) is quoted
140:    """
141:    if isinstance(value, tuple):
142:       return value[2].startswith('"')
143:    else:
144:       return value.startswith('"')
145: 
146: class Recipients(object):
147:    def __init__(self, controlFileList=None, *args, **kwargs):
148:       object.__init__(self, *args, **kwargs)
149:       self.rcpt_count = 0
150:       self.rcpt_abuse = 0
151:       self.relay = False
152: 
153:       for cf in controlFileList:
154:          with open(cf) as fp:
155:             for line in fp:
156:                if line[0] == 'r':
157:                   self.rcpt_count = self.rcpt_count + 1
158:                   if line[1:7].lower() == 'abuse@':
159:                      self.rcpt_abuse = self.rcpt_abuse + 1
160:                elif line[0] == 'u':
161:                   if line[1:9] == 'authsmtp':
162:                      self.relay = True
163: 
164:    def can_pass(self):
165:       "Return true if the only recipient(s) are RFC2142 abuse-mailbox(es)"
166:       return self.rcpt_count > 0 and self.rcpt_abuse == self.rcpt_count
167: 
168: # Python 3.7.3:
169: # Use unstructured fields by default.  Structured ContentTypeHeader
170: # parses fields too cleverly, so that the following, found in the wild:
171: #
172: #     Content-Type: application/octet-stream; name=3D"198646.zip"
173: #
174: # becomes:
175: #
176: #     Content-Type: application/octet-stream; name=3D
177: #
178: # That way a potential threat can get away without being uncompressed.
179: #
180: # Must pay attention to potential API changes in this area.
181: myemailpolicy = EmailPolicy(header_factory=HeaderRegistry(use_default_map=False))
182: 
183: class MyMessage(EmailMessage):
184:    """Email message with comments stripped
185:    """
186:    def __init__(self, *args, **kwargs):
187:       EmailMessage.__init__(self, *args, **kwargs)
188: 
189:    def get_filename(self, failobj=None):
190:       """Return the filename associated with the payload if present.
191: 
192:       The filename is extracted from the Content-Disposition header's
193:       `filename' parameter.  If that header is missing the `filename'
194:       parameter, this method falls back to looking for the `name' parameter.
195:       """
196:       # changed from original: get the unquoted string
197:       missing = object()
198:       filename = self.get_param('filename', missing, 'content-disposition',
199:          unquote=False)
200:       if filename is missing:
201:          filename = self.get_param('name', missing, 'content-type', unquote=False)
202:       if filename is missing:
203:          return failobj
204: 
205:       # added to original: non quoted comments are removed
206:       bare = is_quoted(filename)
207:       if not bare:
208:          filename = _unquotevalue(filename)
209:       filename = email.utils.collapse_rfc2231_value(filename)
210:       if bare and '(' in filename:
211:          filename = de_comment(filename)
212:       # malformed values, e.g. name=3D"blah", we only remove trailing char
213:       while filename.endswith(('"', "'", '>', ',', ';')):
214:          filename = filename[0:-1]
215:       return filename.strip().lower()
216: 
217: def reader_entry(which):
218:    # print('Entered', which, 'reader')
219:    pass
220: 
221: def check_message(msg):
222:    block = False
223:    for part in msg.walk():
224:       try:
225:          # reader of attached email message
226:          def mail_reader():
227:             reader_entry('mail')
228:             cte = str(part.get('content-transfer-encoding', '')).lower()
229:             if cte in ('quoted-printable', 'base64', 'x-uuencode', 'uuencode', 'uue', 'x-uue'):
230:                payload = part.get_payload(decode=True)
231:             else: # 7bit, 8bit
232:                payload = part.get_payload(decode=False)
233:                if part.is_multipart():
234:                   return payload[0].as_string().encode()
235:                # When is_multipart() returns False, the payload should be a string object.
236:                # https://docs.python.org/release/3.7.3/library/email.message.html#email.message.EmailMessage.is_multipart
237:             return bytes(payload);
238: 
239:          # multipart/* are just containers
240:          if part.get_content_maintype() == 'multipart':
241:             continue
242: 
243:          if part.get_content_type() == 'message/rfc822':
244:             inner_msg = email.message_from_bytes(mail_reader(),
245:                policy=myemailpolicy, _class=MyMessage)
246:             return check_message(inner_msg)
247: 
248:          # get_filename() is in MyMessage
249:          filename = part.get_filename()
250:          if filename:
251:             # print part.get_content_type(), filename
252:             if block_file(filename, mail_reader):
253:                return True
254: 
255:       finally:
256:          pass
257: 
258:    return False
259: 
260: def block_ole_file(filename, data):
261:    try:
262:       # Macros
263:       parser = oletools.olevba.VBA_Parser(BytesIO(data), data=data, relaxed=True)
264:       # Heuristic: if an OpenXML contains an OLE container, it is suspicious
265:       if parser.type == 'OpenXML':
266:          if len(parser.ole_subfiles) > 0:
267:             sys.stderr.write('attachments OpenXML contains an OLE container\n')
268:             return True
269:       if parser.detect_vba_macros():
270:          vba_code_all = ''
271:          for (subfilename, stream_path, vba_filename, vba_code) in parser.extract_macros():
272:             vba_code_all += vba_code + '\n'
273:          mraptor = MacroRaptor(vba_code_all)
274:          mraptor.scan()
275:          if mraptor.suspicious:
276:             sys.stderr.write('attachments Found mraptor.suspicious\n')
277:             return True
278: 
279:       # External stuff
280:       # DISABLED on 09 Mar 2022
281:       #filedata = BytesIO(data)
282:       #if is_zipfile(filedata):
283:       #  xml_parser = XmlParser(filedata)
284:       #  # This does not catch http://schemas.openxmlformats.org/officeDocument/2006/relationships/image
285:       #  # but instead catches simple, non-autoloading links
286:       #  for relationship, target in find_external_relationships(xml_parser):
287:       #     if not target.startswith('file:'):
288:       #        sys.stderr.write('attachments ' + "Found relationship '%s' with external link %s" % (relationship, target) + '\n')
289:       #        return True # one is enough
290: 
291:    except oletools.olevba.FileOpenError as e:
292:       sys.stderr.write('attachments FileOpenError: ' + str(e) + '\n')
293:    except oletools.ooxml.BadOOXML as e:
294:       sys.stderr.write('attachments: ' + str(e) + '\n')
295: 
296: def block_file(filename, reader):
297:    """
298:       Check if a file should be blocked, either because of its extension
299:       or its content.  If content must be examined, the reader is called.
300:       For Python3, the reader returns bytes.
301:       filename must be defined and lower().strip()
302:       Return True if blocking is deserved.
303:    """
304:    # print('block_file', filename)
305:    if filename.endswith(blocked_extensions):
306:       sys.stderr.write('attachments Blocked extension: ' + filename +'\n')
307:       return True
308: 
309:    if filename.endswith(archive_extensions):
310:       # print filename
311:       try:
312:          zmem = reader()
313:          with libarchive.memory_reader(zmem) as archive:
314:             for entry in archive:
315:                def archive_reader():
316:                   reader_entry('archive')
317:                   mem = bytearray();
318:                   for block in entry.get_blocks():
319:                      mem += bytearray(block)
320:                   return bytes(mem)
321: 
322:                if block_file(entry.pathname, archive_reader):
323:                   return True
324:       except libarchive.exception.ArchiveError as e:
325:          if e.retcode == libarchive.ffi.ARCHIVE_FATAL:
326:             # Unrecognized archive format, e.g. rar v5
327:             sys.stderr.write('attachments Unrecognized archive format: ' + filename +'\n')
328:             return True
329:       finally:
330:          pass
331: 
332:    elif filename.endswith(".gz"):
333:       def gunzip_reader():
334:          reader_entry('gunzip')
335:          myvars = object()
336:          myvars.mem = bytearray()
337:          myvars.just_1 = 0
338:          myvars.size = -1
339:          with libarchive.memory_reader(reader(),
340:                format_name='raw', filter_name='gzip') as archive:
341:             for entry in archive:
342:                myvars.just_1 += 1
343:                if myvars.just_1 != 1 or entry.size != None:
344:                   raise ValueError('Invalid gzip format')
345:                for block in entry.get_blocks():
346:                   myvars.mem.append(block)
347:          return bytes(myvars.mem)
348:       return block_file(filename[0:len(filename)-3], gunzip_reader)
349: 
350:    elif filename.endswith(office_extensions):
351:       try:
352:          data = reader()
353:          if oletools.rtfobj.is_rtf(data, treat_str_as_data=True):
354:             rtfp = oletools.rtfobj.RtfObjParser(data)
355:             rtfp.parse()
356:             for rtfobj in rtfp.objects:
357:                if rtfobj.is_ole:
358:                   if rtfobj.oledata_size is None:
359:                      # format_id=TYPE_LINKED?
360:                      return True
361:                   elif block_ole_file(filename, rtfobj.oledata):
362:                      return True
363:                elif rtfobj.is_package:
364:                   sys.stderr.write('attachments Found RTF package\n')
365:                   return True
366:          else:
367:             return block_ole_file(filename, data)
368:       finally:
369:          pass
370:    elif filename.endswith('.eml'):
371:       msg = email.message_from_bytes(reader(),
372:          policy=myemailpolicy, _class=MyMessage)
373:       return check_message(msg)
374:    return False
375: 
376: def doFilter(bodyFile, controlFileList):
377:    "Function called by Pythonfilter"
378:    try:
379:       with open(bodyFile) as fp:
380:          msg = email.message_from_file(fp,
381:             policy=myemailpolicy, _class=MyMessage)
382:       block = check_message(msg)
383:       if block:
384:          rcpts = Recipients(controlFileList)
385:          if rcpts.can_pass():
386:             return ''
387: 
388:          return "550 Attachment rejected for policy reasons"
389: 
390:    except Exception as e:
391:       sys.stderr.write('attachments ' + type(e).__name__ + ': ' + str(e) + '\n')
392:       # print(traceback.format_exc())
393:    # nothing found --> to the next filter
394:    return ''
395: 
396: 
397: if __name__ == '__main__':
398:    # For debugging, you can create a file that contains a message
399:    # body, possibly including attachments.
400:    # Run this script with the name of that file as an argument,
401:    # and it'll print either a permanent failure code to indicate
402:    # that the message would be rejected, or print nothing to
403:    # indicate that the remaining filters would be run.
404:    if len(sys.argv) != 2:
405:       print("Usage: attachments.py <message_body_file>")
406:       sys.exit(0)
407:    re = doFilter(sys.argv[1], [])
408:    if (re == ''):
409:       re = '(empty string)'
410:    print(re)
411: 
GPLv3

Pythonic installation, anyone?

While courier-pythonfilter now has a pip installer, this filter is still rustic. If you have better ideas, please write on list.