HTML 컨텐츠에서 스크립트 태그 제거

code

HTML 컨텐츠에서 스크립트 태그 제거

codestyles 2021. 1. 5. 08:08

HTML 컨텐츠에서 스크립트 태그 제거

HTML Purifier (http://htmlpurifier.org/)를 사용하고 있습니다.

<script>태그 만 제거하고 싶습니다 . 인라인 서식이나 다른 것을 제거하고 싶지 않습니다.

이것을 어떻게 달성 할 수 있습니까?

한 가지 더, HTML에서 스크립트 태그를 제거하는 다른 방법이 있습니다.

이 질문에는 정규식 태그가 지정되어 있기 때문에이 상황에서 가난한 사람의 솔루션으로 대답 할 것입니다.

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

그러나 정규 표현식은 HTML / XML 구문 분석을위한 것이 아닙니다. 완벽한 표현식 을 작성하더라도 결국에는 깨질 것입니다. 그러나 어떤 경우에는 일부 마크 업을 빠르게 수정하는 것이 유용하고 빠른 수정과 마찬가지로 유용합니다. 보안 은 잊어 버리세요 . 신뢰할 수있는 콘텐츠 / 마크 업에만 정규식을 사용하세요.

사용자가 입력 한 모든 내용은 안전하지 않은 것으로 간주되어야합니다 .

여기에서 더 나은 해결책 DOMDocument은 이것을 위해 설계된 것을 사용 하는 것입니다. 다음은 동일한 작업을 수행하는 것이 얼마나 쉽고, 깨끗하고 (정규식에 비해), (거의) 신뢰할 수 있고, (거의) 안전한지 보여주는 스 니펫입니다.

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

이것조차도 지루할 수 있기 때문에 의도적으로 HTML을 제거했습니다 .

PHP DOMDocument파서를 사용하십시오 .

$doc = new DOMDocument();

// load the HTML string we want to strip
$doc->loadHTML($html);

// get all the script tags
$script_tags = $doc->getElementsByTagName('script');

$length = $script_tags->length;

// for each tag, remove it from the DOM
for ($i = 0; $i < $length; $i++) {
  $script_tags->item($i)->parentNode->removeChild($script_tags->item($i));
}

// get the HTML string back
$no_script_html_string = $doc->saveHTML();

이것은 다음 HTML 문서를 사용하여 나를 일했습니다.

<!doctype html>
<html>
    <head>
        <meta charset="utf-8">
        <title>
            hey
        </title>
        <script>
            alert("hello");
        </script>
    </head>
    <body>
        hey
    </body>
</html>

그냥 마음에 부담 DOMDocument파서는 PHP 5 이상이 필요합니다.

$html = <<<HTML
...
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags_to_remove = array('script','style','iframe','link');
foreach($tags_to_remove as $tag){
    $element = $dom->getElementsByTagName($tag);
    foreach($element  as $item){
        $item->parentNode->removeChild($item);
    }
}
$html = $dom->saveHTML();

나는이 질문으로 고심하고 있었다. 나는 당신이 정말로 하나의 기능 만 필요하다는 것을 발견했습니다. explode ( '>', $ html); 모든 태그에 대한 단일 공통 분모는 <및>입니다. 그 다음에는 보통 따옴표 ( ")입니다. 공통 분모를 찾으면 쉽게 정보를 추출 할 수 있습니다. 이것이 제가 생각 해낸 것입니다.

$html = file_get_contents('http://some_page.html');

$h = explode('>', $html);

foreach($h as $k => $v){

    $v = trim($v);//clean it up a bit

    if(preg_match('/^(<script[.*]*)/ius', $v)){//my regex here might be questionable

        $counter = $k;//match opening tag and start counter for backtrace

        }elseif(preg_match('/([.*]*<\/script$)/ius', $v)){//but it gets the job done

            $script_length = $k - $counter;

            $counter = 0;

            for($i = $script_length; $i >= 0; $i--){
                $h[$k-$i] = '';//backtrace and clear everything in between
                }
            }           
        }
for($i = 0; $i <= count($h); $i++){
    if($h[$i] != ''){
    $ht[$i] = $h[$i];//clean out the blanks so when we implode it works right.
        }
    }
$html = implode('>', $ht);//all scripts stripped.


echo $html;

중첩 된 스크립트 태그가 없기 때문에 실제로 스크립트 태그에 대해서만 작동하는 것으로 보입니다. 물론 동일한 검사를 수행하고 중첩 된 태그를 수집하는 더 많은 코드를 쉽게 추가 할 수 있습니다.

나는 그것을 아코디언 코딩이라고 부릅니다. implode (); explode (); 공통 분모가있는 경우 논리 흐름을 얻는 가장 쉬운 방법입니다.

ctf0의 답변을 수정하는 예. 이것은 preg_replace를 한 번만 수행해야하지만 슬래시에 대한 오류 및 블록 char 코드도 확인해야합니다.

$str = '<script> var a - 1; <&#47;script>'; 

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius';
$replace = preg_replace($pattern, '', $str); 
return ($replace !== null)? $replace : $str;

If you are using php 7 you can use the null coalesce operator to simplify it even more.

$pattern = '/(script.*?(?:\/|&#47;|&#x0002F;)script)/ius'; 
return (preg_replace($pattern, '', $str) ?? $str);

I would use BeautifulSoup if it's available. Makes this sort of thing very easy.

Don't try to do it with regexps. That way lies madness.

Shorter:

$html = preg_replace("/<script.*?\/script>/s", "", $html);

When doing regex things might go wrong, so it's safer to do like this:

$html = preg_replace("/<script.*?\/script>/s", "", $html) ? : $html;

So that when the "accident" happen, we get the original $html instead of empty string.

this is a merge of both ClandestineCoder & Binh WPO.

the problem with the script tag arrows is that they can have more than one variant

ex. (< = < = &lt;) & ( > = > = &gt;)

so instead of creating a pattern array with like a bazillion variant, imho a better solution would be

return preg_replace('/script.*?\/script/ius', '', $text)
       ? preg_replace('/script.*?\/script/ius', '', $text)
       : $text;

이것은 script.../script화살표 코드 / 변형에 관계없이 보이는 모든 것을 제거 하고 여기 https://regex101.com/r/lK6vS8/1 에서 테스트 할 수 있습니다.

이것은 Dejan Marjanovic의 대답을 단순화 한 변형입니다.

function removeTags($html, $tag) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    foreach (iterator_to_array($dom->getElementsByTagName($tag)) as $item) {
        $item->parentNode->removeChild($item);
    }
    return $dom->saveHTML();
}

다음을 포함한 모든 종류의 태그를 제거하는 데 사용할 수 있습니다 <script>.

$scriptlessHtml = removeTags($html, 'script');

str_replace 함수를 사용하여 빈 공간이나 다른 것으로 대체하십시오.

$query = '<script>console.log("I should be banned")</script>';

$badChar = array('<script>','</script>');
$query = str_replace($badChar, '', $query);

echo $query; 
//this echoes console.log("I should be banned")

문자열을 조작하는 간단한 방법.

$str = stripStr($str, '<script', '</script>');

function stripStr($str, $ini, $fin)
{
    while(($pos = mb_stripos($str, $ini)) !== false)
    {
        $aux = mb_substr($str, $pos + mb_strlen($ini));
        $str = mb_substr($str, 0, $pos).mb_substr($aux, mb_stripos($aux, $fin) + mb_strlen($fin));
    }

    return $str;
}

참조 URL : https://stackoverflow.com/questions/7130867/remove-script-tag-from-html-content

'code' 카테고리의 다른 글

Notepad ++ 정규식 줄 찾기 및 삭제 (0)	2021.01.06
ExpandableListView에서 프로그래밍 방식으로 그룹 축소 (0)	2021.01.05
다음 중 내부 오류가 발생했습니다. "Maven 프로젝트 업데이트 중 (0)	2021.01.05
Eclipse 시작 중단,“Android SDK : 오류 마커 해결” (0)	2021.01.05
TabLayout 탭 스타일 (0)	2021.01.05

현재글HTML 컨텐츠에서 스크립트 태그 제거

codestyle

HTML 컨텐츠에서 스크립트 태그 제거

HTML 컨텐츠에서 스크립트 태그 제거

'code' 카테고리의 다른 글

'code'의 다른글

티스토리툴바

HTML 컨텐츠에서 스크립트 태그 제거

HTML 컨텐츠에서 스크립트 태그 제거

'code' 카테고리의 다른 글

'code'의 다른글

관련글

티스토리툴바