Alternative attachment filter for the Python filtering architecture of Courier MTA

Courier-MTA has a generic filtering interface different than sendmail's milter. You can find links to implementations here, including courier-pythonfilter. The latter can also be found in pypi. It includes the attachments.py that the filter presented here is alternative to. You must install that filter before trying this.

Differences and other dependencies

The original attachments.py uses libarchive-c if present, but doesn't require it. The present filter does. (Mind a Python package with a similar name whose namespace collides with libarchive and hence is difficult to recognize once installed.) In addition, it requires oletools, a Python library to analyze OLE and MS Office files. They are both listed in requirements.txt.

The filter won't block messages destined exclusively to abuse@ mailboxes. That's meant to let abuse teams receive complaints. You may want to alter it (search can_pass), which is not quite a trendy way to maintain software —see below.

Install

Careful with that pip install as it will try and install courier-pythonfilter. If it's already installed and you're in a virtualenv, you need no sudo.

sudo pip install -r "http://www.tana.it/svn/pyfilters/trunk/requirements.txt"

curl -O "http://www.tana.it/svn/pyfilters/trunk/attachments.py"
python -m compileall -l "attachments.py"
sudo mv "attachments.pyc" "/usr/local/lib/python2.7/dist-packages/pythonfilter"

sudo courierfilter stop
sudo courierfilter start

Check the /python2.7/ destination directory is correct! If you'd like to consider a pythonic install, please see below.

Source

001: #!/usr/bin/python
002: " attachments -- Courier filter which blocks specified attachment types"
003: # Copyright (C) 2005-2008  Robert Penz <robert@penz.name>
004: # hacked (H) 2017 ale
005: #
020: 
021: import sys
022: from email.message import Message, _unquotevalue
023: import email.utils
024: import binascii
025: 
026: # this is libarchive-c
027: import libarchive
028: 
029: import oletools.olevba
030: from oletools.mraptor import MacroRaptor
031: from oletools import rtfobj
032: 
033: from io import BytesIO
034: 
035: # Extensions.  Assume any extension appears in at most one list.
036: # Each list has a different treatment.
037: # Maintain:
038: # $ i=0; for e in $(sort < temp |uniq ); do printf " '%s'," $e; if [ $((++i % 8)) -eq 0 ]; then printf "\n"; fi; done; printf "\n"
039: 
040: # https://support.google.com/mail/answer/6590?hl=en
041: # http://www.theverge.com/2017/1/25/14391462/gmail-javascript-block-file-attachments-malware-security
042: # https://kb.intermedia.net/Article/23567
043: 
044: blocked_extensions = (
045:  '.acc', '.ade', '.adp', '.asp', '.bat', '.ccs', '.chm', '.class',
046:  '.cmd', '.com', '.cpl', '.dll', '.dmg', '.drv', '.exe', '.grp',
047:  '.hlp', '.hta', '.htx', '.ins', '.isp', '.jar', '.je', '.js',
048:  '.jse', '.lib', '.lnk', '.mde', '.msc', '.msh', '.msh1', '.msh1xml',
049:  '.msh2', '.msh2xml', '.mshxml', '.msi', '.msp', '.mst', '.ocx', '.ovl',
050:  '.pcd', '.php', '.php3', '.pif', '.ps1', '.ps1xml', '.ps2', '.ps2xml',
051:  '.psc1', '.psc2', '.reg', '.sbs', '.scr', '.sct', '.shb', '.shd',
052:  '.shs', '.sys', '.vb', '.vba', '.vbe', '.vbs', '.vdl', '.vxd',
053:  '.ws', '.wsc', '.wsf', '.wsh', '.wst')
054: 
055: 
056: # extensions supported by VBA_parser, see also
057: # https://en.wikipedia.org/wiki/List_of_Microsoft_Office_filename_extensions
058: # https://datatypes.net/open-ade-files
059: # https://docs.microsoft.com/en-us/deployoffice/security/block-specific-file-format-types-in-office
060: # office_extensions = (
061: #  '.doc', '.dot',   #- Word 97-2003
062: #  '.docm', '.dotm', #- Word 2007+
063: #  '.xml',           #- Word 2003 XML
064: #  '.mht',           #- Word MHT - Single File Web Page / MHTML
065: #  '.xls',           #- Excel 97-2003
066: #  '.xlsm', '.xlsb', #- Excel 2007+
067: #  '.ppt',           #- PowerPoint 97-2003
068: #  '.pptm', '.ppsm') #- PowerPoint 2007+
069: 
070: office_extensions = (
071:  '.accda', '.accdb', '.accde', '.accdr', '.accdt', '.ade', '.adn', '.adp',
072:  '.cdb', '.doc', '.docb', '.docm', '.docx', '.dot', '.dotm', '.dotx',
073:  '.htm', '.html', '.laccdb', '.ldb', '.maf', '.mam', '.maq', '.mar',
074:  '.mat', '.mda', '.mdb', '.mde', '.mdf', '.mdn', '.mdt', '.mdw',
075:  '.mht', '.mhtml', '.ods', '.pot', '.potm', '.potx', '.ppam', '.ppax',
076:  '.pps', '.ppsm', '.ppsx', '.ppt', '.pptm', '.pptx', '.rtf', '.sldm',
077:  '.sldx', '.thmx', '.wbk', '.wiz', '.xla', '.xlam', '.xlb', '.xlcxlk',
078:  '.xll', '.xlm', '.xlmss', '.xls', '.xlsb', '.xlsm', '.xlsx', '.xlt',
079:  '.xltm', '.xltx', '.xlw')
080: 
081: 
082: # extensions implemented as a zip container have a variety of media
083: # files, but macro are still implemented as OLE containers.
084: # See 'Heuristic' below.
085: # https://www.codeproject.com/Articles/15216/Office-2007-bin-file-format
086: # https://kb.intermedia.net/Article/23567
087: 
088: 
089: 
090: # https://en.wikipedia.org/wiki/List_of_archive_formats but must be in
091: # https://github.com/libarchive/libarchive/wiki/ManPageLibarchiveFormats5
092: archive_extensions = ('.zip', '.tar.gz', '.tgz', '.tar.Z', '.tar.bz2',
093:    '.tbz2', '.tar.lzma', '.tlz', '.7z', '.ace', '.rar')
094: 
095: 
096: # TODO: detect documents with TargetMode="External" and DDE, see:
097: # http://staaldraad.github.io/2017/10/23/msword-field-codes/
098: 
099: 
100: def de_comment(field):
101:    """Parse a header field fragment and remove comments.
102: 
103:    copied from AddrlistClass.getdelimited() in email/_parseaddr.py
104:    """
105: 
106:    slist = ['']
107:    quote = False
108:    pos = 0
109:    depth = 0
110:    while pos < len(field):
111:       if quote:
112:          quote = False
113:       elif field[pos] == '(':
114:          depth += 1
115:       elif field[pos] == ')':
116:          depth = max(depth - 1, 0)
117:          pos += 1
118:          continue
119:       elif field[pos] == '\\':
120:          quote = True
121:       if depth == 0:
122:          slist.append(field[pos])
123:       pos += 1
124: 
125:    return ''.join(slist)
126: 
127: def is_quoted(value):
128:    """ Check whether a value (string or tuple) is quoted
129:    """
130:    if isinstance(value, tuple):
131:       return value[2].startswith('"')
132:    else:
133:       return value.startswith('"')
134: 
135: class Recipients(object):
136:    def __init__(self, controlFileList=None, *args, **kwargs):
137:       object.__init__(self, *args, **kwargs)
138:       self.rcpt_count = 0
139:       self.rcpt_abuse = 0
140:       self.relay = False
141: 
142:       for cf in controlFileList:
143:          with open(cf) as fp:
144:             for line in fp:
145:                if line[0] == 'r':
146:                   self.rcpt_count = self.rcpt_count + 1
147:                   if line[1:7].lower() == 'abuse@':
148:                      self.rcpt_abuse = self.rcpt_abuse + 1
149:                elif line[0] == 'u':
150:                   if line[1:9] == 'authsmtp':
151:                      self.relay = True
152: 
153:    def can_pass(self):
154:       "Return true if the only recipient(s) are RFC2142 abuse-mailbox(es)"
155:       return self.rcpt_count > 0 and self.rcpt_abuse == self.rcpt_count
156: 
157: 
158: class MyMessage(Message):
159:    """Email message with comments stripped
160:    """
161:    def __init__(self, *args, **kwargs):
162:       Message.__init__(self, *args, **kwargs)
163: 
164:    def get_filename(self, failobj=None):
165:       """Return the filename associated with the payload if present.
166: 
167:       The filename is extracted from the Content-Disposition header's
168:       `filename' parameter.  If that header is missing the `filename'
169:       parameter, this method falls back to looking for the `name' parameter.
170:       """
171:       # changed from original: get the unquoted string
172:       missing = object()
173:       filename = self.get_param('filename', missing, 'content-disposition',
174:          unquote=False)
175:       if filename is missing:
176:          filename = self.get_param('name', missing, 'content-type', unquote=False)
177:       if filename is missing:
178:          return failobj
179: 
180:       # added to original: non quoted comments are removed
181:       bare = is_quoted(filename)
182:       if not bare:
183:          filename = _unquotevalue(filename)
184:       filename = email.utils.collapse_rfc2231_value(filename)
185:       if bare and '(' in filename:
186:          filename = de_comment(filename)
187:       # malformed values, e.g. name=3D"blah", we only remove trailing char
188:       while filename.endswith(('"', "'", '>', ',', ';')):
189:          filename = filename[0:-1]
190:       return filename.strip().lower()
191: 
192: def reader_entry(which):
193:    # print 'Entered', which, 'reader'
194:    pass
195: 
196: def check_message(msg):
197:    block = False
198:    for part in msg.walk():
199:       try:
200:          # reader of attached email message
201:          def mail_reader():
202:             # return part.get_payload(decode=True)
203:             # copied in order to detect malformed stuff
204:             reader_entry('mail')
205:             payload = part.get_payload()
206:             cte = part.get('content-transfer-encoding', '').lower()
207:             if cte == 'quoted-printable':
208:                return binascii.a2b_qp(payload)
209:             elif cte == 'base64':
210:                mem = ''
211:                for line in payload.split():
212:                   ln = line.strip()
213:                   try:
214:                      mem += binascii.a2b_base64(ln)
215:                   except binascii.Error:
216:                      # Incorrect padding
217:                      l = ''
218:                      for c in ln:
219:                         if not (c.isalnum() or c in '+/'):
220:                            break
221:                         l += c
222:                      if len(l) % 4:
223:                         l+='===='[0:4-len(l)%4]
224:                      mem += binascii.a2b_base64(l)
225:                return mem
226:             elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
227:                return binascii.a2b_uu(payload)
228:             else: # 7bit, 8bit
229:                if part.is_multipart():
230:                   return payload[0].as_string()
231:                else:
232:                   # When is_multipart() returns False, the payload should be a string object.
233:                   # https://docs.python.org/2/library/email.message.html#email.message.Message.is_multipart
234:                   return payload
235: 
236:          # multipart/* are just containers
237:          if part.get_content_maintype() == 'multipart':
238:             continue
239: 
240:          if part.get_content_type() == 'message/rfc822':
241:             inner_msg = email.message_from_string(mail_reader(), _class=MyMessage)
242:             return check_message(inner_msg)
243: 
244:          # get_filename() is in MyMessage
245:          filename = part.get_filename()
246:          if filename:
247:             # print part.get_content_type(), filename
248:             if block_file(filename, mail_reader):
249:                return True
250: 
251:       finally:
252:          pass
253: 
254:    return False
255: 
256: def block_ole_file(filename, data):
257:    try:
258:       parser = oletools.olevba.VBA_Parser(BytesIO(data), data=data, relaxed=True)
259:       # Heuristic: if an OpenXML contains an OLE container, it is suspicious
260:       if parser.type == 'OpenXML':
261:          if len(parser.ole_subfiles) > 0:
262:             return True
263:       if parser.detect_vba_macros():
264:          vba_code_all = ''
265:          for (subfilename, stream_path, vba_filename, vba_code) in parser.extract_macros():
266:             vba_code_all += vba_code + '\n'
267:          mraptor = MacroRaptor(vba_code_all)
268:          mraptor.scan()
269:          if mraptor.suspicious:
270:             return True
271:    except oletools.olevba.FileOpenError as e:
272:       sys.stderr.write('attachments FileOpenError: ' + str(e) + '\n')
273: 
274: def block_file(filename, reader):
275:    """
276:       Check if a file should be blocked, either because of its extension
277:       or its content.  If content must be examined, the reader is called.
278:       filename must be defined and lower().strip()
279:       Return True if blocking is deserved.
280:    """
281:    # print 'block_file', filename
282:    if filename.endswith(blocked_extensions):
283:       return True
284: 
285:    if filename.endswith(archive_extensions):
286:       # print filename
287:       try:
288:          zmem = reader()
289:          with libarchive.memory_reader(zmem) as archive:
290:             for entry in archive:
291:                def archive_reader():
292:                   reader_entry('archive')
293:                   mem = ''
294:                   for block in entry.get_blocks():
295:                      mem += block
296:                   return mem
297: 
298:                if block_file(entry.pathname, archive_reader):
299:                   return True
300:       except libarchive.exception.ArchiveError as e:
301:          if e.retcode == libarchive.ffi.ARCHIVE_FATAL:
302:             # Unrecognized archive format, e.g. rar v5
303:             return True
304:       finally:
305:          pass
306: 
307:    elif filename.endswith(".gz"):
308:       def gunzip_reader():
309:          reader_entry('gunzip')
310:          myvars = object()
311:          myvars.mem = ''
312:          myvars.just_1 = 0
313:          myvars.size = -1
314:          with libarchive.memory_reader(reader(),
315:                format_name='raw', filter_name='gzip') as archive:
316:             for entry in archive:
317:                myvars.just_1 += 1
318:                if myvars.just_1 != 1 or entry.size != None:
319:                   raise ValueError('Invalid gzip format')
320:                for block in entry.get_blocks():
321:                   myvars.mem += block
322:          return myvars.mem
323:       return block_file(filename[0:len(filename)-3], gunzip_reader)
324: 
325:    elif filename.endswith(office_extensions):
326:       try:
327:          data = reader()
328:          if oletools.rtfobj.is_rtf(data, treat_str_as_data=True):
329:             rtfp = oletools.rtfobj.RtfObjParser(data)
330:             rtfp.parse()
331:             for rtfobj in rtfp.objects:
332:                if rtfobj.is_ole:
333:                   if rtfobj.oledata_size is None:
334:                      # format_id=TYPE_LINKED?
335:                      return True
336:                   elif block_ole_file(filename, rtfobj.oledata):
337:                      return True
338:                elif rtfobj.is_package:
339:                   return True
340:          else:
341:             return block_ole_file(filename, data)
342:       finally:
343:          pass
344:    elif filename.endswith('.eml'):
345:       msg = email.message_from_string(reader(), _class=MyMessage)
346:       return check_message(msg)
347:    return False
348: 
349: def doFilter(bodyFile, controlFileList):
350:    "Function called by Pythonfilter"
351:    try:
352:       rcpts = Recipients(controlFileList)
353:       if rcpts.can_pass():
354:          return ''
355: 
356:       msg = email.message_from_file(open(bodyFile), _class=MyMessage)
357:       block = check_message(msg)
358:       if block:
359:          return "550 Attachment rejected for policy reasons"
360: 
361:    except Exception as e:
362:       sys.stderr.write('attachments ' + type(e).__name__ + ': ' + str(e) + '\n')
363: 
364:    # nothing found --> to the next filter
365:    return ''
366: 
367: 
368: if __name__ == '__main__':
369:    # For debugging, you can create a file that contains a message
370:    # body, possibly including attachments.
371:    # Run this script with the name of that file as an argument,
372:    # and it'll print either a permanent failure code to indicate
373:    # that the message would be rejected, or print nothing to
374:    # indicate that the remaining filters would be run.
375:    if len(sys.argv) != 2:
376:       print "Usage: attachments.py <message_body_file>"
377:       sys.exit(0)
378:    re = doFilter(sys.argv[1], [])
379:    if (re == ''):
380:       re = '(empty string)'
381:    print re
382: 
GPLv3

Pythonic installation, anyone?

Courier-pythonfilter doesn't seem to support independently written filters. In addition, this ones uses exactly the same name as an existing filters, so it has to be re-installed everytime courier-pythonfilter is upgraded. Perhaps, courier-pythonfilter should define a namespace for the filters.

For another topic, I found it easier to write a can_pass function than using courier-pythonfilter configuration. A well configured framework can provide for a start filter which just reads the control file and honors whitelisted targets by skipping all subsequent filters. However, I'm not sure all filters deserve the same whitelisting. An alternative approach would be to cache the contents of the control file, so as to share them among filters.

Besides the Courier-users mailing list, this page also provides for hypothesis annotations if you allowed their javascript. You might have noticed those icons in the upper right corner. Please register at their server in order to write comments.